catalyst-cooperative / Pudl

Licence: mit
The Public Utility Data Liberation Project

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Pudl

Global Power Plant Database
A comprehensive, global, open source database of power plants
Stars: ✭ 171 (-14.5%)
Mutual labels:  open-data, energy
oeplatform
Repository for the code of the Open Energy Platform (OEP) website. The OEP provides an interface to the Open Energy Family
Stars: ✭ 49 (-75.5%)
Mutual labels:  energy, open-data
Aspjson
A fast classic ASP JSON parser and encoder for easy JSON manipulation to work with the new JavaScript MV* libraries and frameworks.
Stars: ✭ 165 (-17.5%)
Mutual labels:  utility
Code Notes
Tool to summarise all code annotation like TODO or FIXME
Stars: ✭ 192 (-4%)
Mutual labels:  utility
Hcl Picker
🎨 Colorpicker for data
Stars: ✭ 178 (-11%)
Mutual labels:  utility
Windpowerlib
The windpowerlib is a library to model the output of wind turbines and farms.
Stars: ✭ 170 (-15%)
Mutual labels:  energy
Txeh
Go library and CLI utilty for /etc/hosts management.
Stars: ✭ 181 (-9.5%)
Mutual labels:  utility
Bnetlauncher
Launcher utility to help start battle.net games with the steam overlay.
Stars: ✭ 161 (-19.5%)
Mutual labels:  utility
Paco
Small utility library for coroutine-driven asynchronous generic programming in Python 3.4+
Stars: ✭ 198 (-1%)
Mutual labels:  utility
Oemof Solph
A model generator for energy system modelling and optimisation (LP/MILP).
Stars: ✭ 176 (-12%)
Mutual labels:  energy
Draxt
draxt.js – NodeList/jQuery-like package for File System (node.js)
Stars: ✭ 192 (-4%)
Mutual labels:  utility
Onebusaway Application Modules
The core OneBusAway application suite.
Stars: ✭ 174 (-13%)
Mutual labels:  open-data
Restyle
Stars: ✭ 171 (-14.5%)
Mutual labels:  utility
Rsocrata
Provides easier interaction with Socrata open data portals http://dev.socrata.com. Users can provide a 'Socrata' data set resource URL, or a 'Socrata' Open Data API (SoDA) web query, or a 'Socrata' "human-friendly" URL, returns an R data frame. Converts dates to 'POSIX' format. Manages throttling by 'Socrata'.
Stars: ✭ 182 (-9%)
Mutual labels:  open-data
Fastenum
The world fastest enum utilities for C#/.NET
Stars: ✭ 165 (-17.5%)
Mutual labels:  utility
Magda
A federated, open-source data catalog for all your big data and small data
Stars: ✭ 193 (-3.5%)
Mutual labels:  open-data
Node Git Server
🎡 A configurable git server written in Node.js
Stars: ✭ 163 (-18.5%)
Mutual labels:  utility
Vue Breakpoints
🍬 🙈 Vue.js utility component to show and hide components based on breakpoints
Stars: ✭ 179 (-10.5%)
Mutual labels:  utility
Data Curator
Data Curator - share usable open data
Stars: ✭ 199 (-0.5%)
Mutual labels:  open-data
Ts Toolbelt
ts-toolbelt is the largest, and most tested type library available right now, featuring +200 utilities. Our type collection packages some of the most advanced mapped types, conditional types, and recursive types on the market.
Stars: ✭ 3,099 (+1449.5%)
Mutual labels:  utility

=============================================================================== The Public Utility Data Liberation Project (PUDL)

.. readme-intro

.. image:: https://www.repostatus.org/badges/latest/active.svg :target: https://www.repostatus.org/#active :alt: Project Status: Active – The project has reached a stable, usable state and is being actively developed.

.. image:: https://github.com/catalyst-cooperative/pudl/workflows/tox-pytest/badge.svg :target: https://github.com/catalyst-cooperative/pudl/actions?query=workflow%3Atox-pytest :alt: Tox-PyTest Status

.. image:: https://img.shields.io/readthedocs/catalystcoop-pudl :target: https://catalystcoop-pudl.readthedocs.io/en/latest/ :alt: Read the Docs Build Status

.. image:: https://img.shields.io/codecov/c/github/catalyst-cooperative/pudl :target: https://codecov.io/gh/catalyst-cooperative/pudl :alt: Codecov Test Coverage

.. image:: https://img.shields.io/codacy/grade/2fead07adef249c08288d0bafae7cbb5 :target: https://app.codacy.com/app/zaneselvans/pudl :alt: Codacy Grade

.. image:: https://img.shields.io/pypi/v/catalystcoop.pudl :target: https://pypi.org/project/catalystcoop.pudl/ :alt: PyPI Latest Version

.. image:: https://img.shields.io/pypi/pyversions/catalystcoop.pudl :target: https://pypi.org/project/catalystcoop.pudl/ :alt: PyPI - Supported Python Versions

.. image:: https://img.shields.io/conda/vn/conda-forge/catalystcoop.pudl :target: https://anaconda.org/conda-forge/catalystcoop.pudl :alt: conda-forge Version

.. image:: https://zenodo.org/badge/80646423.svg :target: https://zenodo.org/badge/latestdoi/80646423 :alt: Zenodo DOI

PUDL <https://catalyst.coop/pudl/>__ makes US energy data easier to access and use. Hundreds of gigabytes of information is available from government agencies, but it's often difficult to work with, and different sources can be hard to combine. PUDL takes the original spreadsheets, CSV files, and databases and turns them into unified tabular data packages <https://specs.frictionlessdata.io/tabular-data-package/>__ that can be used to populate a database, or read in directly with Python, R, Microsoft Access, and many other tools.

The project currently integrates data from:

  • EIA Form 860 <https://www.eia.gov/electricity/data/eia860/>__
  • EIA Form 861 <https://www.eia.gov/electricity/data/eia861/>__
  • EIA Form 923 <https://www.eia.gov/electricity/data/eia923/>__
  • The EPA Continuous Emissions Monitoring System (CEMS) <https://ampd.epa.gov/ampd/>__
  • FERC Form 1 <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-1-electric-utility-annual>__
  • FERC Form 714 <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-no-714-annual-electric/data>__
  • The US Census Demographic Profile 1 Geodatabase <https://www.census.gov/geographies/mapping-files/2010/geo/tiger-data.html>__

The project is focused on serving researchers, activists, journalists, and policy makers that might not otherwise be able to afford access to this data from existing commercial data providers. You can sign up for PUDL email updates here <https://catalyst.coop/updates/>__.

Quick Start

Install Anaconda <https://www.anaconda.com/distribution/>__ or miniconda <https://docs.conda.io/en/latest/miniconda.html>__ (see this detailed setup guide <https://www.mrdbourke.com/get-your-computer-ready-for-machine-learning-using-anaconda-miniconda-and-conda/>__ if you need help) and then work through the following commands.

Create and activate a conda environment named pudl that installs packages from the community maintained conda-forge channel. In addition to the catalystcoop.pudl package, install JupyterLab so we can work with the PUDL data interactively.

.. code-block:: console

$ conda create --yes --name pudl --channel conda-forge \
    --strict-channel-priority python=3.8 \
    catalystcoop.pudl jupyter jupyterlab pip
$ conda activate pudl

Now create a data management workspace called pudl-work. The workspace has a well defined directory structure that PUDL uses to organize the data it downloads, processes, and outputs. Run pudl_setup --help for details.

.. code-block:: console

$ mkdir pudl-work
$ pudl_setup pudl-work

Now that we have some raw data, we can run the PUDL ETL (Extract, Transform, Load) pipeline to clean it up and integrate it together. There are several steps:

  • Cloning the FERC Form 1 database into SQLite
  • Extracting data from that database and other sources and cleaning it up
  • Outputting the clean data into CSV/JSON based data packages, and finally
  • Loading the data packages into a local database or other storage medium.

PUDL provides a script to clone the FERC Form 1 database. The script is called ferc1_to_sqlite and it is controlled by a YAML file. An example can be found in the settings folder:

.. code-block:: console

$ ferc1_to_sqlite pudl-work/settings/ferc1_to_sqlite_example.yml

The main ETL process is controlled by another YAML file defining the data that will be processed. A well commented etl_example.yml can also be found in the settings directory of the PUDL workspace you set up. The script that runs the ETL process is called pudl_etl:

.. code-block:: console

$ pudl_etl pudl-work/settings/etl_example.yml

This generates a bundle of tabular data packages in pudl-work/datapkg/pudl-example

Tabular data packages are made up of CSV and JSON files. They're relatively easy to parse programmatically, and readable by humans. They are also well suited to archiving, citation, and bulk distribution, but they are static.

To make the data easier to query and work with interactively, we typically load it into a local SQLite database using this script, which first combines several data packages from the same bundle into a single data package,

.. code-block:: console

$ datapkg_to_sqlite \
    pudl-work/datapkg/pudl-example/ferc1-example/datapackage.json \
    pudl-work/datapkg/pudl-example/eia-example/datapackage.json \

The EPA CEMS data is ~100 times larger than all of the other data we have integrated thus far, and loading it into SQLite takes a very long time. We've found the most convenient way to work with it is using Apache Parquet <https://parquet.apache.org>__ files, and have a script that converts the EPA CEMS Hourly table from the generated datapackage into that format. To convert the example EPA CEMS data package you can run:

.. code-block:: console

$ epacems_to_parquet pudl-work/datapkg/pudl-example/epacems-eia-example/datapackage.json

The resulting Apache Parquet dataset will be stored in pudl-work/parquet/epacems and will be partitioned by year and by state, so that you can read in only the relevant portions of the dataset. (Though in the example, you'll only find 2019 data for Idaho)

Now that you have a live database, we can easily work with it using a variety of tools, including Python, pandas dataframes, and Jupyter Notebooks <https://jupyter.org>__. This command will start up a local Jupyter notebook server, and open a notebook containing some simple PUDL usage examples, which is distributed with the Python package, and deployed into your workspace:

.. code-block:: console

$ jupyter lab pudl-work/notebook/pudl_intro.ipynb

For more usage and installation details, see our more in-depth documentation <https://catalystcoop-pudl.readthedocs.io/>__ on Read The Docs.

Contributing to PUDL

Find PUDL useful? Want to help make it better? There are lots of ways to contribute!

  • Please be sure to read our Code of Conduct <https://catalystcoop-pudl.readthedocs.io/en/latest/code_of_conduct.html>__
  • You can file a bug report, make a feature request, or ask questions in the Github issue tracker <https://github.com/catalyst-cooperative/pudl/issues>__.
  • Feel free to fork the project and make a pull request with new code, better documentation, or example notebooks.
  • Make a recurring financial contribution <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=PZBZDFNKBJW5E&source=url>__ to support our work liberating public energy data.
  • Hire us to do some custom analysis <https://catalyst.coop/hire-catalyst/>__ and allow us to integrate the resulting code into PUDL.
  • For more information check out our Contribution Guidelines <https://catalystcoop-pudl.readthedocs.io/en/latest/CONTRIBUTING.html>__

Licensing

The PUDL software is released under the MIT License <https://opensource.org/licenses/MIT>. The PUDL documentation <https://catalystcoop-pudl.readthedocs.io> and the data packages we distribute are released under the CC-BY-4.0 <https://creativecommons.org/licenses/by/4.0/>__ license.

Contact Us

For help with initial setup, usage questions, bug reports, suggestions to make PUDL better and anything else that could conceivably be of use or interest to the broader community of users, use the PUDL issue tracker <https://github.com/catalyst-cooperative/pudl/issues>. on Github. For private communication about the project, you can email the team: [email protected] <mailto:[email protected]>

About Catalyst Cooperative

Catalyst Cooperative <https://catalyst.coop>__ is a small group of data scientists and policy wonks. We’re organized as a worker-owned cooperative consultancy. Our goal is a more just, livable, and sustainable world. We integrate public data and perform custom analyses to inform public policy. Our focus is primarily on mitigating climate change and improving electric utility regulation in the United States.

Do you work on renewable energy or climate policy? Have you found yourself scraping data from government PDFs, spreadsheets, websites, and databases, without getting something reusable? We build tools to pull this kind of information together reliably and automatically so you can focus on your real work instead — whether that’s political advocacy, energy journalism, academic research, or public policymaking.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].