All Projects → m-lab → etl

m-lab / etl

Licence: Apache-2.0 License
M-Lab ingestion pipeline

Programming Languages

go
31211 projects - #10 most used programming language
shell
77523 projects
javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to etl

Go Streams
A lightweight stream processing library for Go
Stars: ✭ 615 (+4000%)
Mutual labels:  pipeline, etl
Mara Pipelines
A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
Stars: ✭ 1,841 (+12173.33%)
Mutual labels:  pipeline, etl
Phila Airflow
Stars: ✭ 16 (+6.67%)
Mutual labels:  pipeline, etl
Stetl
Stetl, Streaming ETL, is a lightweight geospatial processing and ETL framework written in Python.
Stars: ✭ 64 (+326.67%)
Mutual labels:  pipeline, etl
naas
⚙️ Schedule notebooks, run them like APIs, expose securely your assets: Jupyter as a viable ⚡️ Production environment
Stars: ✭ 219 (+1360%)
Mutual labels:  pipeline, etl
Datavec
ETL Library for Machine Learning - data pipelines, data munging and wrangling
Stars: ✭ 272 (+1713.33%)
Mutual labels:  pipeline, etl
Setl
A simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (+426.67%)
Mutual labels:  pipeline, etl
basin
Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
Stars: ✭ 25 (+66.67%)
Mutual labels:  pipeline, etl
Bulk Writer
Provides guidance for fast ETL jobs, an IDataReader implementation for SqlBulkCopy (or the MySql or Oracle equivalents) that wraps an IEnumerable, and libraries for mapping entites to table columns.
Stars: ✭ 210 (+1300%)
Mutual labels:  pipeline, etl
Airbyte
Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
Stars: ✭ 4,919 (+32693.33%)
Mutual labels:  pipeline, etl
Metl
mito ETL tool
Stars: ✭ 153 (+920%)
Mutual labels:  pipeline, etl
sparklanes
A lightweight data processing framework for Apache Spark
Stars: ✭ 17 (+13.33%)
Mutual labels:  pipeline, etl
mydataharbor
🇨🇳 MyDataHarbor是一个致力于解决任意数据源到任意数据源的分布式、高扩展性、高性能、事务级的数据同步中间件。帮助用户可靠、快速、稳定的对海量数据进行准实时增量同步或者定时全量同步,主要定位是为实时交易系统服务,亦可用于大数据的数据同步(ETL领域)。
Stars: ✭ 28 (+86.67%)
Mutual labels:  pipeline, etl
lineage
Generate beautiful documentation for your data pipelines in markdown format
Stars: ✭ 16 (+6.67%)
Mutual labels:  pipeline, etl
bash-streams-handbook
💻 Learn Bash streams, pipelines and redirection, from beginner to advanced.
Stars: ✭ 153 (+920%)
Mutual labels:  pipeline
pipelines-as-code
Pipelines as Code
Stars: ✭ 37 (+146.67%)
Mutual labels:  pipeline
es2postgres
ElasticSearch to PostgreSQL loader
Stars: ✭ 18 (+20%)
Mutual labels:  etl
oesophagus
Enterprise Grade Single-Step Streaming Data Infrastructure Setup. (Under Development)
Stars: ✭ 12 (-20%)
Mutual labels:  etl
prose
A python framework to process FITS images. Built for Astronomy.
Stars: ✭ 21 (+40%)
Mutual labels:  pipeline
TEAM
The Taxonomy for ETL Automation Metadata (TEAM) is a metadata management tool for data warehouse automation. It is part of the ecosystem for data warehouse automation, alongside the Virtual Data Warehouse pattern manager and the generic schema for Data Warehouse Automation.
Stars: ✭ 27 (+80%)
Mutual labels:  etl

etl

branch travis-ci report-card coveralls
master Travis Build Status Coverage Status
integration Travis Build Status Go Report Card Coverage Status

ETL (extract, transform, load) is a core component of the M-Lab data processing pipeline. The ETL worker is responsible for parsing data archives produced by pusher and publishing M-Lab measurements to BigQuery.

Local Development

go get ./cmd/etl_worker
gcloud auth application-default login
~/bin/etl_worker -service_port :8080 -output_dir ./output -output local

From the command line (or with a browser) make a request to the /v2/worker resource with a filename= parameter that names a valid M-Lab GCS archive.

URL=gs://archive-measurement-lab/ndt/ndt7/2021/06/14/20210614T003000.696927Z-ndt7-mlab1-yul04-ndt.tgz
curl "http://localhost:8080/v2/worker?filename=$URL"

Generating Schema Docs

To build a new docker image with the generate_schema_docs command, run:

$ docker build -t measurementlab/generate-schema-docs .
$ docker run -v $PWD:/workspace -w /workspace \
  -it measurementlab/generate-schema-docs

Writing schema_ndtresultrow.md
...

Moving to GKE

The universal parser will run in GKE, using parser-pool node pools, defined like this:

gcloud --project=mlab-sandbox container node-pools create parser-pool-1 \
  --cluster=data-processing   --num-nodes=3   --region=us-east1 \
  --scopes storage-ro,compute-rw,bigquery,datastore \
  --node-labels=parser-node=true   --enable-autorepair --enable-autoupgrade \
  --machine-type=n1-standard-16

The images come from gcr.io, and are built by google cloud build. The build trigger is currently found with:

gcloud beta builds triggers list --filter=m-lab/etl

Deployment requires adding cloud-kubernetes-deployer role to etl-travis-deploy@ in IAM. This is done for sandbox and staging.

Migrating to Sink interface

The parsers currently use etl.Inserter as the backend for writing records. This API is overly shaped by bigquery, and complicates testing and extension.

The row.Sink interface, and row.Buffer define cleaner APIs for the back end and for buffering and annotating. This will streamline migration to Gardener driven table selection, column partitioned tables, and possibly future migration to BigQuery loads instead of streaming inserts.

Factories

The TaskFactory aggregates a number of other factories for the elements required for a Task. Factory injection is used to generalize ProcessGKETask, and simplify testing.

  • SinkFactory produces a Sink for output.
  • SourceFactory produces a Source for the input data.
  • AnnotatorFactory produces an Annotator to be used to annotate rows.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].