All Projects → mozilla → python_mozetl

mozilla / python_mozetl

Licence: MIT license
ETL jobs for Firefox Telemetry

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to python mozetl

sparklanes
A lightweight data processing framework for Apache Spark
Stars: ✭ 17 (-32%)
Mutual labels:  etl, pyspark
Pyspark Example Project
Example project implementing best practices for PySpark ETL jobs and applications.
Stars: ✭ 633 (+2432%)
Mutual labels:  etl, pyspark
lineage
Generate beautiful documentation for your data pipelines in markdown format
Stars: ✭ 16 (-36%)
Mutual labels:  etl, pyspark
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (+56%)
Mutual labels:  etl, pyspark
basin
Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
Stars: ✭ 25 (+0%)
Mutual labels:  etl, pyspark
Butterfree
A tool for building feature stores.
Stars: ✭ 126 (+404%)
Mutual labels:  etl, pyspark
flask-spark-docker
Just a boilerplate for PySpark and Flask
Stars: ✭ 32 (+28%)
Mutual labels:  pyspark
pyspark-ML-in-Colab
Pyspark in Google Colab: A simple machine learning (Linear Regression) model
Stars: ✭ 32 (+28%)
Mutual labels:  pyspark
zdh web
大数据采集,抽取平台
Stars: ✭ 292 (+1068%)
Mutual labels:  etl
django-data-migration
Data migration framework for Django that migrates legacy data into your new django app
Stars: ✭ 18 (-28%)
Mutual labels:  etl
uptasticsearch
An Elasticsearch client tailored to data science workflows.
Stars: ✭ 47 (+88%)
Mutual labels:  etl
ceja
PySpark phonetic and string matching algorithms
Stars: ✭ 24 (-4%)
Mutual labels:  pyspark
Sparkora
Powerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟
Stars: ✭ 51 (+104%)
Mutual labels:  pyspark
proc-that
proc(ess)-that - easy extendable ETL tool for Node.js. Written in TypeScript.
Stars: ✭ 25 (+0%)
Mutual labels:  etl
starlake
Starlake is a Spark Based On Premise and Cloud ELT/ETL Framework for Batch & Stream Processing
Stars: ✭ 16 (-36%)
Mutual labels:  etl
zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
Stars: ✭ 655 (+2520%)
Mutual labels:  etl
flock
Flock: A Low-Cost Streaming Query Engine on FaaS Platforms
Stars: ✭ 232 (+828%)
Mutual labels:  etl
csv-cruncher
Treats CSV and JSON files as SQL tables, and exports SQL SELECTs back to CSV or JSON.
Stars: ✭ 32 (+28%)
Mutual labels:  etl
polygon-etl
ETL (extract, transform and load) tools for ingesting Polygon blockchain data to Google BigQuery and Pub/Sub
Stars: ✭ 53 (+112%)
Mutual labels:  etl
sync-engine-example
Synchronization Algorithm Exploration: Techniques to synchronize a SQL database with external destinations.
Stars: ✭ 17 (-32%)
Mutual labels:  etl

Firefox Telemetry Python ETL

CircleCI codecov

This repository is a collection of ETL jobs for Firefox Telemetry.

Benefits

Jobs committed to python_mozetl can be scheduled via airflow or ATMO. We provide a testing suite and code review, which makes your job more maintainable. Centralizing our jobs in one repository allows for code reuse and easier collaboration.

There are a host of benefits to moving your analysis out of a Jupyter notebook and into a python package. For more on this see the writeup at cookiecutter-python-etl.

Tests

Dependencies

First install the necessary runtime dependencies -- snappy and the java runtime environment. These are used for the pyspark package. In ubuntu:

$ sudo apt-get install libsnappy-dev openjdk-8-jre-headless

Calling the test runner

Run tests by calling tox in the root directory.

Arguments to pytest can be passed through tox using --.

tox -- -k test_main.py # runs tests only in the test_main module

Tests are configured in tox.ini

Manual Execution

ATMO

The first method of manual execution is the mozetl-submit.sh script located in bin. This script is used with the EMRSparkOperator in telemetry-airflow to schedule execution of mozetl jobs. It may be used with ATMO to manually test jobs.

In an SSH session with an ATMO cluster, grab a copy of the script:

$ wget https://raw.githubusercontent.com/mozilla/python_mozetl/main/bin/mozetl-submit.sh

Push your code to your own fork, where the job has been added to mozetl.cli. Then run it.

$ ./mozetl-submit.sh \
    -p https://github.com/<USERNAME>/python_mozetl.git \
    -b <BRANCHNAME> \
    <COMMAND> \
        --first-argument foo \
        --second-argument bar

See comments in bin/mozetl-submit.sh for more details.

Databricks

Jobs may also be executed on Databricks. They are scheduled via the MozDatabricksSubmitRunOperator in telemetry-airflow.

This script runs on your local machine and submits the job to a remote spark executor. First, generate an API token in the User Settings page in Databricks. Then run the script.

python bin/mozetl-databricks.py \
    --git-path https://github.com/<USERNAME>/python_mozetl.git \
    --git-branch <BRANCHNAME> \
    --token <TOKEN>  \
    <COMMAND> \
        --first-argument foo \
        --second-argument bar

Run python bin/mozetl-databricks.py --help for more options, including increasing the number of workers and using python 3. Refer to this pull request for more examples.

It is also possible to use this script for external mozetl-compatible modules by setting the --git-path and --module-name options appropriately. See this pull request for more information about building a mozetl-compatible repository that can be scheduled on Databricks.

Scheduling

You can schedule your job on either ATMO or airflow.

Scheduling a job on ATMO is easy and does not require review, but is less maintainable. Use ATMO to schedule jobs you are still prototyping or jobs that have a limited lifespan.

Jobs scheduled on Airflow will be more robust.

  • Airflow will automatically retry your job in the event of a failure.
  • You can also alert other members of your team when jobs fail, while ATMO will only send an email to the job owner.
  • If your job depends on other datasets, you can identify these dependencies in Airflow. This is useful if an upstream job fails.

ATMO

To schedule a job on ATMO, take a look at the load_and_run notebook. This notebook clones and installs the python_mozetl package. You can then run your job from the notebook.

Airflow

To schedule a job on Airflow, you'll need to add a new Operator to the DAGs and provide a shell script for running your job. Take a look at this example shell script. and this example Operator for templates.

Early Stage ETL Jobs

We usually require tests before accepting new ETL jobs. If you're still prototyping your job, but you'd like to move your code out of a Jupyter notebook take a look at cookiecutter-python-etl.

This tool will initialize a new repository with all of the necessary boilerplate for testing and packaging. In fact, this project was created with cookiecutter-python-etl.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].