All Projects → deeplearning4j → Datavec

deeplearning4j / Datavec

Licence: apache-2.0
ETL Library for Machine Learning - data pipelines, data munging and wrangling

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Datavec

Setl
A simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (-70.96%)
Mutual labels:  spark, pipeline, etl
Stetl
Stetl, Streaming ETL, is a lightweight geospatial processing and ETL framework written in Python.
Stars: ✭ 64 (-76.47%)
Mutual labels:  pipeline, etl, transformations
basin
Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
Stars: ✭ 25 (-90.81%)
Mutual labels:  spark, pipeline, etl
Transmogrifai
TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
Stars: ✭ 2,084 (+666.18%)
Mutual labels:  spark, transformations
Dataspherestudio
DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.
Stars: ✭ 1,195 (+339.34%)
Mutual labels:  spark, etl
Udacity Data Engineering
Udacity Data Engineering Nano Degree (DEND)
Stars: ✭ 89 (-67.28%)
Mutual labels:  spark, etl
Metorikku
A simplified, lightweight ETL Framework based on Apache Spark
Stars: ✭ 361 (+32.72%)
Mutual labels:  spark, etl
Omniparser
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.
Stars: ✭ 148 (-45.59%)
Mutual labels:  schema, etl
Spark Bigquery
Google BigQuery support for Spark, Structured Streaming, SQL, and DataFrames with easy Databricks integration.
Stars: ✭ 65 (-76.1%)
Mutual labels:  schema, spark
Graphql Parser
A graphql query language and schema definition language parser and formatter for rust
Stars: ✭ 203 (-25.37%)
Mutual labels:  schema, formatter
sparklanes
A lightweight data processing framework for Apache Spark
Stars: ✭ 17 (-93.75%)
Mutual labels:  pipeline, etl
Luigi Warehouse
A luigi powered analytics / warehouse stack
Stars: ✭ 72 (-73.53%)
Mutual labels:  spark, etl
Pyspark Example Project
Example project implementing best practices for PySpark ETL jobs and applications.
Stars: ✭ 633 (+132.72%)
Mutual labels:  spark, etl
Wedatasphere
WeDataSphere is a financial level one-stop open-source suitcase for big data platforms. Currently the source code of Scriptis and Linkis has already been released to the open-source community. WeDataSphere, Big Data Made Easy!
Stars: ✭ 372 (+36.76%)
Mutual labels:  spark, etl
mydataharbor
🇨🇳 MyDataHarbor是一个致力于解决任意数据源到任意数据源的分布式、高扩展性、高性能、事务级的数据同步中间件。帮助用户可靠、快速、稳定的对海量数据进行准实时增量同步或者定时全量同步,主要定位是为实时交易系统服务,亦可用于大数据的数据同步(ETL领域)。
Stars: ✭ 28 (-89.71%)
Mutual labels:  pipeline, etl
lineage
Generate beautiful documentation for your data pipelines in markdown format
Stars: ✭ 16 (-94.12%)
Mutual labels:  pipeline, etl
Osom
An Awesome [/osom/] Object Data Modeling (Database Agnostic).
Stars: ✭ 68 (-75%)
Mutual labels:  schema, transformations
Airbyte
Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
Stars: ✭ 4,919 (+1708.46%)
Mutual labels:  pipeline, etl
Bulk Writer
Provides guidance for fast ETL jobs, an IDataReader implementation for SqlBulkCopy (or the MySql or Oracle equivalents) that wraps an IEnumerable, and libraries for mapping entites to table columns.
Stars: ✭ 210 (-22.79%)
Mutual labels:  pipeline, etl
naas
⚙️ Schedule notebooks, run them like APIs, expose securely your assets: Jupyter as a viable ⚡️ Production environment
Stars: ✭ 219 (-19.49%)
Mutual labels:  pipeline, etl

DataVec

DataVec is an Apache 2.0-licensed library for machine-learning ETL (Extract, Transform, Load) operations. DataVec's purpose is to transform raw data into usable vector formats that can be fed to machine learning algorithms. By contributing code to this repository, you agree to make your contribution available under an Apache 2.0 license.

Why Would I Use DataVec?

Data handling is sometimes messy, and we believe it should be distinct from high-performance algebra libraries (such as nd4j or Deeplearning4j).

DataVec allows a practitioner to take raw data and produce open standard compliant vectorized data (svmLight, etc) quickly. Current input data types supported out of the box:

  • CSV Data
  • Raw Text Data (Tweets, Text Documents, etc)
  • Image Data
  • LibSVM
  • SVMLight
  • MatLab (MAT) format
  • JSON, XML, YAML, XML

Datavec draws inspiration from a lot of the Hadoop ecosystem tools, and in particular accesses data on disk through the Hadoop API (like Spark does), which means it's compatible with many records

DataVec also includes sophisticated functionality for feature engineering, data cleaning and data normalization both for static data and for sequences (time series). Such operations can be executed on Apache Spark using DataVec-Spark.

Datavec's architecture : API, transforms and filters, and schema management

Apart from obviously providing readers for classic data formats, DataVec also provides an interface. So if you wanted to ingest specific custom data, you don't have to do the whole pipeline, you just have to do the very first step. You describe through the API how your data fits into a common format that complies with the interface, in this case, DataVec will return a list of Writables for each record. You'll find more detail on the API in the corresponding module.

Another thing you can do with DataVec is data cleaning functionality. Instead of having clean ready-to-go data, say you start with maybe data in different forms or from different sources. You might need to do sampling, filtering, or several of all those incredibly messy ETL tasks that you need to prepare data in the real world. DataVec offers filters and transformations that help with curating, preparing and massaging your data. It leverages Apache Spark to do this at scale.

Finally, DataVec tracks a schema for your columnar data, across all transformations. This schema is actively checked through probing, and DataVec will raise exceptions if your data does not match the schema. You can specify filters as well: you can attach a regular expression to an input column of type String, for example, and DataVec will only keep data that matches this filter

On Distribution

Distributed treatment through Apache Spark is entirely optional, including running Spark in local-mode (where your cluster is emulated with multi-threading) when necessary. Datavec aims to abstract away from the actual execution and create at compile time, a logical set of operations to execute. While we have some code that uses Spark, we do not want to be locked into a single tool, and using Apache Flink or Beam are possibilities, on which we would welcome collaboration.

Examples

Examples for using DataVec are available here: https://github.com/deeplearning4j/dl4j-examples


Contribute

Where to contribute?

We have a lot on the pipeline, and even more we'd love to receive contributions. We want to support representing data as more than a collection of simple types ("writables"), and rather as binary data — that will help with GC pressure across our pipelines and fit better with media-based uses cases, where columnar data is not essential. We also expect it will streamline a lot of the specialized operations we now do on primitive types.

With that being said, an area that could welcome a first contribution is the implementations of the RecordReader interface, since this is relatively self-contained. Of note, to support most of the distributed file formats of the Hadoop ecosystem, we use Apache Camel. Camel supports a pluggable DataFormat to allow messages to be marshalled to and from binary or text formats to support a kind of Message Translator.

Another area that is relatively self-contained is transformations, where you might find a filter or data munging operation that has not been implemented yet, and provide it in a self-contained way.

Which maintainers to contact?

It's often useful to have an idea of which maintainers to contact to get information on a particular part of the code, including reviewing your pull requests, or asking questions on our gitter channel. For this you can use the following, indicative mapping:

How to contribute

  1. Check for open issues, or open a new issue to start a discussion around a feature idea or a bug.

  2. If you feel uncomfortable or uncertain about an issue or your changes, feel free to contact us on Gitter using the link above.

  3. Fork the repository on GitHub to start making your changes.

  4. Write a test, which shows that the bug was fixed or that the feature works as expected.

  5. Note the repository follows the Google Java style with two modifications: 120-char column wrap and 4-spaces indentation. You can format your code to this format by typing mvn formatter:format in the subproject you work on, by using the contrib/formatter.xml at the root of the repository to configure the Eclipse formatter, or by using the INtellij plugin.

  6. Send a pull request, and bug us on Gitter until it gets merged and published.

Eclipse Setup

  1. Downloading the latest jar from https://projectlombok.org/download
  2. Double click the jar to install the plugin for Eclipse
  3. Clone datavec to your system
  4. Import the project as a maven project
  5. You will also need clone and build ND4J and libnd4j
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].