CoxAutomotiveDataSolutions / Waimak

Licence: apache-2.0
Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Waimak

Goodreads etl pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Stars: ✭ 793 (+1221.67%)
Mutual labels:  spark, data-engineering
Docker Hadoop
A Docker container with a full Hadoop cluster setup with Spark and Zeppelin
Stars: ✭ 54 (-10%)
Mutual labels:  spark, hadoop
Bigdataguide
大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料
Stars: ✭ 817 (+1261.67%)
Mutual labels:  spark, hadoop
H2o 3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Stars: ✭ 5,656 (+9326.67%)
Mutual labels:  spark, hadoop
Docker Spark Cluster
A Spark cluster setup running on Docker containers
Stars: ✭ 57 (-5%)
Mutual labels:  spark, hadoop
Pyspark Example Project
Example project implementing best practices for PySpark ETL jobs and applications.
Stars: ✭ 633 (+955%)
Mutual labels:  spark, data-engineering
Kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
Stars: ✭ 916 (+1426.67%)
Mutual labels:  spark, hadoop
Data Science Ipython Notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Stars: ✭ 22,048 (+36646.67%)
Mutual labels:  spark, hadoop
Interview Questions Collection
按知识领域整理面试题,包括C++、Java、Hadoop、机器学习等
Stars: ✭ 21 (-65%)
Mutual labels:  spark, hadoop
Bigdata Interview
🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Stars: ✭ 857 (+1328.33%)
Mutual labels:  spark, hadoop
Alluxio
Alluxio, data orchestration for analytics and machine learning in the cloud
Stars: ✭ 5,379 (+8865%)
Mutual labels:  spark, hadoop
Learning Spark
零基础学习spark,大数据学习
Stars: ✭ 37 (-38.33%)
Mutual labels:  spark, hadoop
Pointblank
Data validation and organization of metadata for data frames and database tables
Stars: ✭ 480 (+700%)
Mutual labels:  spark, data-engineering
Useractionanalyzeplatform
电商用户行为分析大数据平台
Stars: ✭ 645 (+975%)
Mutual labels:  spark, hadoop
Pdf
编程电子书,电子书,编程书籍,包括C,C#,Docker,Elasticsearch,Git,Hadoop,HeadFirst,Java,Javascript,jvm,Kafka,Linux,Maven,MongoDB,MyBatis,MySQL,Netty,Nginx,Python,RabbitMQ,Redis,Scala,Solr,Spark,Spring,SpringBoot,SpringCloud,TCPIP,Tomcat,Zookeeper,人工智能,大数据类,并发编程,数据库类,数据挖掘,新面试题,架构设计,算法系列,计算机类,设计模式,软件测试,重构优化,等更多分类
Stars: ✭ 12,009 (+19915%)
Mutual labels:  spark, hadoop
Szt Bigdata
深圳地铁大数据客流分析系统🚇🚄🌟
Stars: ✭ 826 (+1276.67%)
Mutual labels:  spark, hadoop
Marmaray
Generic Data Ingestion & Dispersal Library for Hadoop
Stars: ✭ 414 (+590%)
Mutual labels:  spark, hadoop
God Of Bigdata
专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...
Stars: ✭ 6,008 (+9913.33%)
Mutual labels:  spark, hadoop
Dockerfiles
50+ DockerHub public images for Docker & Kubernetes - Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak, TeamCity and DevOps tools built on the major Linux distros: Alpine, CentOS, Debian, Fedora, Ubuntu
Stars: ✭ 847 (+1311.67%)
Mutual labels:  spark, hadoop
Data Algorithms Book
MapReduce, Spark, Java, and Scala for Data Algorithms Book
Stars: ✭ 949 (+1481.67%)
Mutual labels:  spark, hadoop

Waimak

Build Status Maven Central Coverage Status Join the chat at https://gitter.im/waimak-framework/users

What is Waimak?

Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.

Waimak aims to abstract the more complex parts of Spark application development (such as orchestration) away from the business logic, allowing users to get their business logic in a production-ready state much faster. By using a framework written by Data Engineers, the teams defining the business logic can write and own their production code.

Our metaphor to describe this framework is the braided river – it splits and rejoins to itself repeatedly on its journey. By describing a Spark application as a sequence of flow transformations, Waimak can execute independent branches of the flow in parallel making more efficient use of compute resources and greatly reducing the execution time of complex flows.

Why would I use Waimak?

We developed Waimak to:

  • allow teams to own their own business logic without owning an entire production Spark application
  • reduce the time it takes to write production-ready Spark applications
  • provide an intuitive structure to Spark applications by describing them as a sequence of transformations forming a flow
  • increase the performance of Spark data flows by making more efficient use of the Spark executors

Importantly, Waimak is a framework for building Spark applications by describing a sequence of composed Spark transformations. To create those transformations Waimak exposes the complete Spark API, giving you the power of Apache Spark with added structure.

How do I get started?

You can import Waimak into your Maven project using the following dependency details:

        <dependency>
            <groupId>com.coxautodata</groupId>
            <artifactId>waimak-core_2.11</artifactId>
            <version>${waimak.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>

Waimak marks the Spark dependency as optional so as not to depend on any specific release of Spark, therefore you must specify the version of Spark you wish to use as a dependency. Waimak should run on any version of Spark 2.2+, however the list of officially tested versions is given below.

The following code snippet demonstrates a basic Waimak example taken from the unit tests:

// Required imports
import com.coxautodata.waimak.dataflow.Waimak

// Initialise basic Waimak objects
val emptyFlow = Waimak.sparkFlow(spark)

// Add actions to the flow
val basicFlow = emptyFlow
    .openCSV(basePath)("csv_1", "csv_2")
    .alias("csv_1", "items")
    .alias("csv_2", "person")
    .writeParquet(baseDest)("items", "person")

// Run the flow
basicFlow.execute()

This example is very small, but in practice flow definitions can become very large depending of the number of inputs and outputs in a job.

The project wiki page provides best practices for structuring your project when dealing with large flows.

What Waimak modules are available?

Waimak currently consists of the following modules:

Artifact ID Purpose Maven Release
waimak-core Core Waimak functionality and generic actions Maven Central
waimak-configuration-databricks Databricks-specific configuration provider using secret scopes (Scala 2.11 only) Maven Central
waimak-impala Impala implementation of the HadoopDBConnector used for commiting labels to an Impala DB Maven Central
waimak-hive Hive implementation of the HadoopDBConnector used for commiting labels to a Hive Metastore Maven Central
waimak-rdbm-ingestion Functionality to ingest inputs from a range of RDBM sources Maven Central
waimak-storage Functionality for providing a hot/cold region-based ingestion storage layer Maven Central
waimak-app Functionality providing Waimak application templates and orchestration Maven Central
waimak-experimental Experimental features currently under development Maven Central
waimak-dataquality Functionality for monitoring and alerting on data quality Maven Central
waimak-deequ Amazon Deequ implementation of data quality monitoring (Scala 2.11 only) Maven Central

What versions of Spark are supported?

Waimak is tested against the following versions of Spark:

Package Maintainer Spark Version Scala Version
Apache Spark 2.2.0 2.11
Apache Spark 2.3.0 2.11
Apache Spark 2.4.0 2.11
Apache Spark 2.4.3 2.12
Cloudera Spark 2.2.0 2.11

Other versions of Spark >= 2.2 are also likely to work and can be added to the list of tested versions if there is sufficient need.

Where can I learn more?

You can find the latest documentation for Waimak on the project wiki page. This README file contains basic setup instructions and general project information.

You can also find details of what's in the latest releases in the changelog.

Finally, you can also talk to the developers and other users directly at our Gitter room.

Can I contribute to Waimak?

We welcome all users to contribute to the development of Waimak by raising pull-requests. We kindly ask that you include suitable unit tests along with proposed changes.

How do I test my contributions?

Waimak is tested against different versions of Spark 2.x to ensure uniform compatibility. The versions of Spark tested by Waimak are given in the <profiles> section of the POM. You can activate a given profile in the POM by using the -P flag: mvn clean package -P apache-2.3.0_2.11

The integration tests of the RDBM ingestion module require Docker therefore you must have the Docker service running and the current user must be able to access the Docker service.

What is Waimak licensed under?

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2018 Cox Automotive UK Limited

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].