All Projects → PhantomInsights → Mexican Government Report

PhantomInsights / Mexican Government Report

Licence: mit
Text Mining on the 2019 Mexican Government Report, covering from extracting text from a PDF file to plotting the results.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Mexican Government Report

Data Science Ipython Notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Stars: ✭ 22,048 (+4561.31%)
Mutual labels:  pandas, numpy, matplotlib
Ai Learn
人工智能学习路线图,整理近200个实战案例与项目,免费提供配套教材,零基础入门,就业实战!包括:Python,数学,机器学习,数据分析,深度学习,计算机视觉,自然语言处理,PyTorch tensorflow machine-learning,deep-learning data-analysis data-mining mathematics data-science artificial-intelligence python tensorflow tensorflow2 caffe keras pytorch algorithm numpy pandas matplotlib seaborn nlp cv等热门领域
Stars: ✭ 4,387 (+827.48%)
Mutual labels:  pandas, numpy, matplotlib
Python-for-data-analysis
No description or website provided.
Stars: ✭ 18 (-96.19%)
Mutual labels:  numpy, pandas, matplotlib
Engezny
Engezny is a python package that quickly generates all possible charts from your dataframe and saves them for you, and engezny is only supporting now uni-parameter visualization using the pie, bar and barh visualizations.
Stars: ✭ 25 (-94.71%)
Mutual labels:  numpy, pandas, matplotlib
Subreddit Analyzer
A comprehensive Data and Text Mining workflow for submissions and comments from any given public subreddit.
Stars: ✭ 447 (-5.5%)
Mutual labels:  pandas, spacy, matplotlib
datascienv
datascienv is package that helps you to setup your environment in single line of code with all dependency and it is also include pyforest that provide single line of import all required ml libraries
Stars: ✭ 53 (-88.79%)
Mutual labels:  numpy, pandas, matplotlib
Python for data analysis 2nd chinese version
《利用Python进行数据分析·第2版》
Stars: ✭ 4,049 (+756.03%)
Mutual labels:  pandas, numpy, matplotlib
Windrose
A Python Matplotlib, Numpy library to manage wind data, draw windrose (also known as a polar rose plot), draw probability density function and fit Weibull distribution
Stars: ✭ 208 (-56.03%)
Mutual labels:  pandas, numpy, matplotlib
Python-Matematica
Explorando aspectos fundamentais da matemática com Python e Jupyter
Stars: ✭ 41 (-91.33%)
Mutual labels:  numpy, pandas, matplotlib
data-analysis-using-python
Data Analysis Using Python: A Beginner’s Guide Featuring NYC Open Data
Stars: ✭ 81 (-82.88%)
Mutual labels:  numpy, pandas, matplotlib
covid-19
Data ETL & Analysis on the global and Mexican datasets of the COVID-19 pandemic.
Stars: ✭ 14 (-97.04%)
Mutual labels:  numpy, pandas, matplotlib
Stats Maths With Python
General statistics, mathematical programming, and numerical/scientific computing scripts and notebooks in Python
Stars: ✭ 381 (-19.45%)
Mutual labels:  pandas, numpy, matplotlib
Udacity-Data-Analyst-Nanodegree
Repository for the projects needed to complete the Data Analyst Nanodegree.
Stars: ✭ 31 (-93.45%)
Mutual labels:  numpy, pandas, matplotlib
introduction to ml with python
도서 "[개정판] 파이썬 라이브러리를 활용한 머신 러닝"의 주피터 노트북과 코드입니다.
Stars: ✭ 211 (-55.39%)
Mutual labels:  numpy, pandas, matplotlib
Python Wechat Itchat
微信机器人,基于Python itchat接口功能实例展示:01-itchat获取微信好友或者微信群分享文章、02-itchat获取微信公众号文章、03-itchat监听微信公众号发送的文章、04 itchat监听微信群或好友撤回的消息、05 itchat获得微信好友信息以及表图对比、06 python打印出微信被删除好友、07 itchat自动回复好友、08 itchat微信好友个性签名词云图、09 itchat微信好友性别比例、10 微信群或微信好友撤回消息拦截、11 itchat微信群或好友之间转发消息
Stars: ✭ 216 (-54.33%)
Mutual labels:  pandas, numpy, matplotlib
Dimensionality-reduction-and-classification-on-Hyperspectral-Images-Using-Python
In this repository, You can find the files which implement dimensionality reduction on the hyperspectral image(Indian Pines) with classification.
Stars: ✭ 63 (-86.68%)
Mutual labels:  numpy, pandas, matplotlib
Machine Learning With Python
Practice and tutorial-style notebooks covering wide variety of machine learning techniques
Stars: ✭ 2,197 (+364.48%)
Mutual labels:  pandas, numpy, matplotlib
Data Science Types
Mypy stubs, i.e., type information, for numpy, pandas and matplotlib
Stars: ✭ 180 (-61.95%)
Mutual labels:  pandas, numpy, matplotlib
Exploratory Data Analysis Visualization Python
Data analysis and visualization with PyData ecosystem: Pandas, Matplotlib Numpy, and Seaborn
Stars: ✭ 78 (-83.51%)
Mutual labels:  numpy, pandas, matplotlib
The-Data-Visualization-Workshop
A New, Interactive Approach to Learning Data Visualization
Stars: ✭ 59 (-87.53%)
Mutual labels:  numpy, pandas, matplotlib

Mexican Government Report Text Analysis

This repository documents the process of extracting text from a PDF, cleaning it, passing it through an NLP pipeline, and presenting the results with graphs.

The PDF is the government report of 2019 that was released on September 1st. The PDF is in the data folder.

Requirements

This project uses the following Python libraries

  • PyPDF2 : For extracting text from PDF files.
  • spaCy : For passing the extracted text into an NLP pipeline.
  • NumPy : For fast matrix operations.
  • pandas : For analysing and getting insights from datasets.
  • matplotlib : For creating graphs and plots.
  • seaborn : For enhancing the style of matplotlib plots.
  • geopandas : For plotting maps.

If you are on Windows I strongly recommend that you use Anaconda and install geopandas using these commands:

conda install -c conda-forge geopandas
conda install -c conda-forge descartes

PDF Extraction

The government report that we will process can be downloaded from the following url: https://www.gob.mx/primerinforme

For your convenience I have added the PDF file in the data folder.

Extracting text from PDF files is a bit unreliable, you lose the original formatting and there are times you don't get any text at all.

Fortunately, this PDF file is simple enough that extracting the text wasn't that complex but I encountered some challenges that I will go in more detail in this section.

For this task, we will use PyPDF2 which is a well-known PDF library.

reader = PyPDF2.PdfFileReader("informe.pdf")
full_text = ""

The page numbers in the PDF are not the same as the reported number of pages, we use this variable to keep track of both.

pdf_page_number = 3

We will only retrieve the first 3 sections of the government report which are between pages 14 and 326. The reason for this is that only the 3 first sections contain a substantial amount of text.

for i in range(14, 327):
    # This block is used to remove the page number at the start of
    # each page. The first if removes page numbers with one digit.
    # The second if removes page numbers with 2 digits and the else
    # statement removes page numbers with 3 digits.
    if pdf_page_number <= 9:
        page_text = reader.getPage(i).extractText().strip()[1:]
    elif pdf_page_number >= 10 and pdf_page_number <= 99:
        page_text = reader.getPage(i).extractText().strip()[2:]
    else:
        page_text = reader.getPage(i).extractText().strip()[3:]

    full_text += page_text.replace("\n", "")
    pdf_page_number += 1

The previous code block ensures that we get the cleanest possible output. It removes page numbers and builds us the full transcript to a single string.

There's a small issue when decoding the PDF file. We will manually fix all the weird characters with their correct equivalents.

for item, replacement in CHARACTERS.items():
    full_text = full_text.replace(item, replacement)

This actually took a few hours to complete, I had to manually map all weird characters with their correct equivalents, some examples are:

CHARACTERS = {
    "ç": "Á",
    "⁄": "á",
    "…": "É",
    "”": "é",
    "ê": "Í",
    "™": 'í',
    "î": "Ó"
}

We will remove all extra white spaces and finally we save the cleaned text into a .txt file.

# This looks weird but that's the most practical way to remove double to quad white spaces.
full_text = full_text.replace("  ", " ").replace(
    "  ", " ").replace("  ", " ")

with open("transcript_clean.txt", "w", encoding="utf-8") as temp_file:
    temp_file.write(full_text)

With the transcript cleaned and well encoded we are ready to run it through a NLP pipeline.

NLP Pipeline

For this step we will make use of spaCy, this library is very well built and easy to use.

After installing it with pip you must install the Spanish model. You can do it by running the following command on your CMD or Terminal:

python -m spacy download es_core_news_md

At this point I made the decision to save the pipeline results to csv files. This way the csv files can be loaded into any statistical toolkit.

We start by preparing our corpus and loading the Spanish model.

corpus = open("transcript_clean.txt", "r", encoding="utf-8").read()
nlp = spacy.load("es_core_news_md")

# Our corpus is bigger than the default limit, we will set
# a new limit equal to its length.
nlp.max_length = len(corpus)

doc = nlp(corpus)

Note: This script takes a few minutes to complete, you will also need a few gigabytes of RAM to properly run it, feel free to skip it and use the csv files in the data folder.

Once everything is loaded, we will just save the results into csv files.

Text Tokens

The code is pretty straight forward, we initialize a list with a header row.

Then we iterate over the doc object and add the token text, lemma, part-of-speech and other properties to our data list and save it to csv using csv.writer().

data_list = [["text", "text_lower", "lemma", "lemma_lower",
                "part_of_speech", "is_alphabet", "is_stopword"]]

for token in doc:
    data_list.append([token.text, token.lower_, token.lemma_, token.lemma_.lower(), token.pos_, token.is_alpha, token.is_stop])

csv.writer(open("./tokens.csv", "w", encoding="utf-8",
                newline="")).writerows(data_list)

It is important to note that a token is not always a word, it can be a punctuation mark or a number. That's why we also save the is_alpha property so we can later filter out non-alphabetic tokens.

We also saved the lowercase form of all tokens and lemmas so we don't have to do it every time we want to process them with pandas.

Text Entities

This is almost the same as the previous code block, the main difference is that we iterate over the doc.ents object.

data_list = [["text", "text_lower", "label"]]

for ent in doc.ents:
    data_list.append([ent.text, ent.lower_, ent.label_])

csv.writer(open("./entities.csv", "w", encoding="utf-8",
                newline="")).writerows(data_list)

Entities refer to real world persons, organizations or locations. Since the model was trained from Wikipedia we will get a broad amount of them.

Text Sentences

This function is a bit different than the previous ones, in this part I counted the number of positive and negative words per sentence.

I used the following dataset of positive and negative words:

https://www.kaggle.com/rtatman/sentiment-lexicons-for-81-languages

To download it you will require to register to Kaggle (free). And from the zip file you will require 2 txt files:

negative_words_es.txt
positive_words_es.txt

Once you have the required files in your workspace we will load them into memory and turn their contents into lists.

with open("positive_words_es.txt", "r", encoding="utf-8") as temp_file:
    positive_words = temp_file.read().splitlines()

with open("negative_words_es.txt", "r", encoding="utf-8") as temp_file:
    negative_words = temp_file.read().splitlines()

Now we will iterate over thedoc.sents object and keep a score for each sentence and save the results to csv.

data_list = [["text", "score"]]

for sent in doc.sents:

    # Only take into account real sentences.
    if len(sent.text) > 10:

        score = 0

        # Start scoring the sentence.
        for word in sent:

            if word.lower_ in positive_words:
                score += 1

            if word.lower_ in negative_words:
                score -= 1

        data_list.append([sent.text, score])

csv.writer(open("./sentences.csv", "w", encoding="utf-8",
                newline="")).writerows(data_list)

After running the 3 functions we will have 3 csv files ready to be analyzed with pandas and plotted with matplotlib.

Plotting the Data

For creating the plots we will use seaborn and matplotlib, the reason for this is that seaborn applies some subtle yet nice looking effects to the plots.

In this project we are going to use bar plots (1 horizontal and 1 vertical) which are very helpful for comparing different values of the same category.

To plot the map we will use geopandas and a shape file of México.

The first thing to do is to set some custom colors that will apply globally to each plot.

sns.set(style="ticks",
    rc={
        "figure.figsize": [12, 7],
        "text.color": "white",
        "axes.labelcolor": "white",
        "axes.edgecolor": "white",
        "xtick.color": "white",
        "ytick.color": "white",
        "axes.facecolor": "#5C0E10",
        "figure.facecolor": "#5C0E10"}
    )

With our style declared we are ready to plot our data.

Most Used Words

We will start by loading the tokens csv file into a DataFrame.

df = pd.read_csv("./data/tokens.csv")

There's a small issue with the lemma of programas (programs), we fix it by grouping together programa and programar.

df.loc[df["lemma_lower"] == "programa", "lemma_lower"] = "programar"

We will only take into account the top 20 alphabet tokens that are longer than 1 character and are not stop words.

words = df[(df["is_alphabet"] == True) & (df["is_stopword"] == False) & (
    df["lemma_lower"].str.len() > 1)]["lemma_lower"].value_counts()[:20]

We will use a seaborn barplot, which allows us to apply a gradient to the color values without any effort.

sns.barplot(x=words.values, y=words.index, palette="Blues_d", linewidth=0)

We add the final customizations.

plt.xlabel("Ocurrences Count")
plt.title("Most Frequent Words")
plt.show()

Most Used Words

In the table below I have added the lemmas, their counts and their closest English equivalents.

Lemma (Spanish) Lemma (English) Total Count
nacional national 806
junio June 783
méxico Mexico 646
programar program 584
millón million 564
peso Mexican Peso 460
diciembre December 407
público public (population) 386
servicio service 357
accionar action 349
desarrollar develop 338
personar people 298
federal federal (government) 283
educación education 279
salud health 278
realizar perform 274
país country 274
social social (programs) 266
atención aid 256
apoyar support 248

Mentions per State

If you are on Windows you will require to use Anaconda which is a Python distribution that comes bundled with hundreds of data science oriented libraries.

After you install Anaconda you will require to install 2 packages, you can do this by running these commands on your CMD or Terminal.

conda install -c conda-forge geopandas
conda install -c conda-forge descartes

Don't try to use pip install geopandas on Windows, it won't work. It will work well on Linux and macOS though.

After you have geopandas installed you will require a shape file of Mexico, you can download it from the following link:

https://www.arcgis.com/home/item.html?id=ac9041c51b5c49c683fbfec61dc03ba8

You will require to unzip the file and rename the folder to mexicostates. This folder contains various files, some with metadata and others with coordinates.

We start by loading the entities csv file and loading the shape file.

df = pd.read_csv("./data/entities.csv")
mexico_df = geopandas.read_file("./mexicostates")

The shape file is loaded as a folder instead of an specific file.

Now we will iterate over all our states, but there are some details. The shape file contains the state names without accent marks and uses the old name of Ciudad de México (it was named Distrito Federal).

Before modifying the shape file DataFrame we 'clean' the state name so they can be matched.

for state in STATES:

    # We remove accent marks and rename Ciudad de Mexico to its former name.
    clean_name = clean_word(state)

    if clean_name == "Ciudad de Mexico":
        clean_name = "Distrito Federal"
    elif clean_name == "Estado de Mexico":
        clean_name = "Mexico"

    # We insert the count value into the row with the matching ADMIN_NAME (state name).
    mexico_df.loc[mexico_df["ADMIN_NAME"] == clean_name,
                    "count"] = len(df[df["text_lower"] == state.lower()])

The shape file DataFrame now has a new column with the counts of each state.

We use the plot method of the DataFrame and specify the column name, the color map and enable the legend.

mexico_df.plot(column="count", cmap="plasma", legend=True)

We add the final customizations.

plt.title("Mentions by State")
plt.axis("off")
plt.tight_layout()
plt.show()

States Count

Sentiment Analysis

For the sentiment analysis I used the lexicon-based method, which consists in counting positive and negative words per sentence.

This method has a few considerations, the most important is that it doesn't know the context of the sentence, so sentences like Reduced the crime levels by 10%. will be considered as negative.

The only way around this is to train a model and use Machine Learning to evaluate the sentences with it.

Unfortunately, this is a time-consuming task and we will have to use the other method.

The first thing to do is to load the sentences csv file and only take into account scores between -10 and 10. There were very few cases where the score goes beyond those boundaries.

df = pd.read_csv("./data/sentences.csv")
df = df[(df["score"] <= 10) & (df["score"] >= -10)]

We will make bars with a score below zero yellow and bars with a score above zero blue, to do this we first create an Array with the same length as our Dataframe.

Each element of the Array is a tuple of 3 decimal values representing an RGB color.

colors = np.array([(0.811, 0.913, 0.145)]*len(df["score"]))

Now we use a boolean mask and set different RGB values where the score is equal or higher than zero.

colors[df["score"] >= 0] = (0.529, 0.870, 0.972)

Now we use a bar plot and pass the x, y and the previously calculated colors.

plt.bar(df.index, df["score"], color=colors, linewidth=0)

We set our yticks in steps of 2, from -10 to 10.

yticks_labels = ["{}".format(int(i)) for i in np.arange(-12, 12, 2)]
plt.yticks(np.arange(-12, 12, 2), yticks_labels)

We add the final customizations.

plt.xlabel("Sentence Number")
plt.ylabel("Score")
plt.title("Sentiment Analysis")
plt.show()

Sentiment Analysis

Conclusion

In previous projects, I have used some NLP workflows, such as tokenizing words and sentences, but this time I wanted to do something a bit more complex.

I learned several new things and their best practices. This new knowledge will come in handy for future projects.

The next steps are to build a Machine Learning model for sentiment analysis and evaluate next year's report with it.

Become a Patron!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].