All Projects → okfn-brasil → Querido Diario

okfn-brasil / Querido Diario

Licence: mit
📰 Brazilian government gazettes, accessible to everyone.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Querido Diario

Serenata De Amor
🕵 Artificial Intelligence for social control of public administration
Stars: ✭ 4,367 (+541.26%)
Mutual labels:  artificial-intelligence, hacktoberfest, open-data, data-science, civic-tech, politics
Free Ai Resources
🚀 FREE AI Resources - 🎓 Courses, 👷 Jobs, 📝 Blogs, 🔬 AI Research, and many more - for everyone!
Stars: ✭ 192 (-71.81%)
Mutual labels:  artificial-intelligence, hacktoberfest, data-science
Nosdeputes.fr
Repository of NosDéputés.fr : the french parliamentary monitoring website
Stars: ✭ 69 (-89.87%)
Mutual labels:  open-data, civic-tech, politics
cia
Citizen Intelligence Agency, open-source intelligence (OSINT) project
Stars: ✭ 79 (-88.4%)
Mutual labels:  politics, open-data, civic-tech
Datascience Ai Machinelearning Resources
Alex Castrounis' curated set of resources for artificial intelligence (AI), machine learning, data science, internet of things (IoT), and more.
Stars: ✭ 414 (-39.21%)
Mutual labels:  artificial-intelligence, data-science
Gorgonia
Gorgonia is a library that helps facilitate machine learning in Go.
Stars: ✭ 4,295 (+530.69%)
Mutual labels:  artificial-intelligence, hacktoberfest
Spacy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Stars: ✭ 21,978 (+3127.31%)
Mutual labels:  artificial-intelligence, data-science
Pba
Efficient Learning of Augmentation Policy Schedules
Stars: ✭ 461 (-32.31%)
Mutual labels:  artificial-intelligence, data-science
Mindsdb
Predictive AI layer for existing databases.
Stars: ✭ 4,199 (+516.59%)
Mutual labels:  artificial-intelligence, hacktoberfest
Caer
High-performance Vision library in Python. Scale your research, not boilerplate.
Stars: ✭ 452 (-33.63%)
Mutual labels:  artificial-intelligence, data-science
Learn Data Science For Free
This repositary is a combination of different resources lying scattered all over the internet. The reason for making such an repositary is to combine all the valuable resources in a sequential manner, so that it helps every beginners who are in a search of free and structured learning resource for Data Science. For Constant Updates Follow me in …
Stars: ✭ 4,757 (+598.53%)
Mutual labels:  artificial-intelligence, data-science
Data Science
Collection of useful data science topics along with code and articles
Stars: ✭ 315 (-53.74%)
Mutual labels:  artificial-intelligence, data-science
Wptools
Wikipedia tools (for Humans): easily extract data from Wikipedia, Wikidata, and other MediaWikis
Stars: ✭ 371 (-45.52%)
Mutual labels:  open-data, data-science
Rosie
🤖 Python application responsible for Serenata de Amor's intelligence
Stars: ✭ 420 (-38.33%)
Mutual labels:  artificial-intelligence, data-science
5calls
Frontend for the 5calls.org site
Stars: ✭ 369 (-45.81%)
Mutual labels:  civic-tech, politics
Mycroft Core
Mycroft Core, the Mycroft Artificial Intelligence platform.
Stars: ✭ 5,489 (+706.02%)
Mutual labels:  artificial-intelligence, hacktoberfest
Unity Sdk
🎮 Unity SDK to use the IBM Watson services.
Stars: ✭ 546 (-19.82%)
Mutual labels:  artificial-intelligence, hacktoberfest
Sentinelsat
Search and download Copernicus Sentinel satellite images
Stars: ✭ 576 (-15.42%)
Mutual labels:  hacktoberfest, open-data
Hyperparameter hunter
Easy hyperparameter optimization and automatic result saving across machine learning algorithms and libraries
Stars: ✭ 648 (-4.85%)
Mutual labels:  artificial-intelligence, data-science
Csinva.github.io
Slides, paper notes, class notes, blog posts, and research on ML 📉, statistics 📊, and AI 🤖.
Stars: ✭ 342 (-49.78%)
Mutual labels:  artificial-intelligence, data-science

Diário Oficial

Diário Oficial is the Brazilian government gazette, one of the best places to know the latest actions of the public administration, with distinct publications in the federal, state and municipal levels.

Even with recurrent efforts of enforcing the Freedom of Information legislation across the country, official communication remains - in most of the territories - in PDFs.

The goal of this project is to upgrade Diário Oficial to the digital age, centralizing information currently only available through separate sources.

When this project was initially released, had two distinct goals: creating crawlers for governments gazettes and parsing bidding exemptions from them. Now going forward, it is limited to the first objective.

Table of Contents

Development environment

The best way to understand how Querido Diário works, is getting the source and run it locally. All crawlers are developed using Scrapy framework. They provide a tutorial so you can learn to use it.

If you are in a Windows computer, before you run the steps below you will need Microsoft Visual Build Tools (download here). When you start the installation you need to select 'C++ build tools' on Workload tab and also 'Windows 10 SDK' and 'MSVC v142 - VS 2019 C++ x64/x86 build tools' on Individual Components tab.

If you are in a Linux-like environment, the following commands will create a new virtual environment - that will keep everything isolated from your system - activate it and install all libraries needed to start running and developing new spiders.

$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install -r data_collection/requirements.txt
$ pre-commit install

In a Windows computer, you can use the code above. You just need to substitute source .venv/bin/activate for .venv/Scripts/activate.bat. The rest is the same as in Linux.

Run Gazette Crawler

After configuring your environment, you will be able to execute and develop new spiders. The Scrapy project is in data_collection directory, so you must enter in to execute the spiders and the scrapy command:

$ cd data_collection

Following we list some helpful commands.

Get list of all available spiders:

$ scrapy list

Execute spider with name spider_name:

$ scrapy crawl spider_name

You can limit the gazettes you want to download passing start_date as argument with YYYY-MM-DD format. The following command will download only gazettes which date is greater than 01/Sep/2020:

$ scrapy crawl sc_florianopolis -a start_date=2020-09-01

Generate multiple spiders from template

You may end up in a situation where you have different cities using the same spider base, such us FecamGazetteSpider. To avoid creating the spider files manually, you can use a script for cases where we have a few spiders that are not complex and from the same spider base.

The spider template lives in the scripts/ folder. Here an example of a generated spider:

from datetime import date
from gazette.spiders.base import ImprensaOficialSpider


class BaGentioDoOuroSpider(ImprensaOficialSpider):

    name = "ba_gentio_do_ouro"
    allowed_domains = ["pmGENTIODOOUROBA.imprensaoficial.org"]
    start_date = date(2017, 2, 1)
    url_base = "http://pmGENTIODOOUROBA.imprensaoficial.org"
    TERRITORY_ID = "2911303"

To run the script, you only need a CSV file following the structure below:

url,city,state,territory_id,start_day,start_month,start_year,base_class
http://pmXIQUEXIQUEBA.imprensaoficial.org,Xique-Xique,BA,2933604,1,1,2017,ImprensaOficialSpider
http://pmWENCESLAUGUIMARAESBA.imprensaoficial.org,Wenceslau Guimarães,BA,2933505,1,1,2017,ImprensaOficialSpider
http://pmVERACRUZBA.imprensaoficial.org,Vera Cruz,BA,2933208,1,4,2017,ImprensaOficialSpider

Once you have the CSV file, run the command:

cd scripts/

python generate_spiders.py new-spiders.csv

That's it. The new spiders will be in the directory data_collection/gazette/spiders/.

Troubleshooting

Python.h missing

While running pip install command, you can get an error like below:

module.c:1:10: fatal error: Python.h: No such file or directory
     #include <Python.h>
              ^~~~~~~~~~
    compilation terminated.
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Please try to install python3-dev. E.g. via apt install python3-dev, if you is using a Debian-like distro, or use your distro manager package. Make sure that you use the correct version (e.g. python3.6-dev or python3.7-dev). You can check your version via python3 --version.

Contributing

If you are interested in fixing issues and contributing directly to the code base, please see the document CONTRIBUTING.md.

Acknowledgments

This project is maintained by Open Knowledge Foundation Brasil, thanks to the support of Digital Ocean and hundreds of other names.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].