All Projects → intel-analytics → Bigdl

intel-analytics / Bigdl

Licence: apache-2.0
Building Large-Scale AI Applications for Distributed Big Data

Programming Languages

python
139335 projects - #7 most used programming language
scala
5932 projects
Jupyter Notebook
11667 projects
java
68154 projects - #9 most used programming language
shell
77523 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to Bigdl

H2o 3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Stars: ✭ 5,656 (+48.33%)
Mutual labels:  spark, big-data, hadoop
Data Science Ipython Notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Stars: ✭ 22,048 (+478.23%)
Mutual labels:  spark, big-data, hadoop
Mmlspark
Simple and Distributed Machine Learning
Stars: ✭ 2,899 (-23.97%)
Mutual labels:  ai, spark, big-data
Sparkrdma
RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark
Stars: ✭ 215 (-94.36%)
Mutual labels:  spark, big-data, hadoop
Gaffer
A large-scale entity and relation database supporting aggregation of properties
Stars: ✭ 1,642 (-56.94%)
Mutual labels:  spark, big-data, hadoop
Docker Spark Cluster
A Spark cluster setup running on Docker containers
Stars: ✭ 57 (-98.51%)
Mutual labels:  spark, big-data, hadoop
Analytics Zoo
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
Stars: ✭ 2,448 (-35.8%)
Mutual labels:  bigdl, distributed-deep-learning, analytics-zoo
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+188.25%)
Mutual labels:  spark, big-data, hadoop
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-96.07%)
Mutual labels:  spark, big-data, hadoop
Xlearning Xdml
extremely distributed machine learning
Stars: ✭ 113 (-97.04%)
Mutual labels:  ai, spark, hadoop
Trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Stars: ✭ 4,581 (+20.14%)
Mutual labels:  big-data, hadoop
Succinct
Enabling queries on compressed data.
Stars: ✭ 257 (-93.26%)
Mutual labels:  spark, big-data
Oie Resources
A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.
Stars: ✭ 283 (-92.58%)
Mutual labels:  ai, big-data
Sparkler
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
Stars: ✭ 362 (-90.51%)
Mutual labels:  spark, big-data
Hive
Apache Hive
Stars: ✭ 4,031 (+5.72%)
Mutual labels:  big-data, hadoop
Transmogrifai
TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
Stars: ✭ 2,084 (-45.34%)
Mutual labels:  ai, spark
Cloudbreak
A tool for provisioning and managing Apache Hadoop clusters in the cloud. Cloudbreak, as part of the Hortonworks Data Platform, makes it easy to provision, configure and elastically grow HDP clusters on cloud infrastructure. Cloudbreak can be used to provision Hadoop across cloud infrastructure providers including AWS, Azure, GCP and OpenStack.
Stars: ✭ 301 (-92.11%)
Mutual labels:  big-data, hadoop
Spline
Data Lineage Tracking And Visualization Solution
Stars: ✭ 306 (-91.97%)
Mutual labels:  spark, hadoop
Datasciencevm
Tools and Docs on the Azure Data Science Virtual Machine (http://aka.ms/dsvm)
Stars: ✭ 153 (-95.99%)
Mutual labels:  ai, big-data
Elasticluster
Create clusters of VMs on the cloud and configure them with Ansible.
Stars: ✭ 298 (-92.18%)
Mutual labels:  spark, hadoop


Building Large-Scale AI Applications for Distributed Big Data


BigDL makes it easy for data scientists and data engineers to build end-to-end, distributed AI applications. The BigDL 2.0 release combines the original BigDL and Analytics Zoo projects, providing the following features:

  • DLlib: distributed deep learning library for Apache Spark (i.e., the original BigDL framework with Keras-style API and Spark ML pipeline support)

  • Orca: seamlessly scale out TensorFlow and PyTorch pipelines for distributed Big Data

  • RayOnSpark: run Ray programs directly on Big Data clusters

  • Chronos: scalable time series analysis using AutoML

  • PPML: privacy preserving big data analysis and machine learning (experimental)

  • Cluster Serving: distributed, real-time model serving

For more information, you may read the docs.


Installing

You can use BigDL on Google Colab without any installation. BigDL also includes a set of notebooks that you can directly open and run in Colab.

To install BigDL, we recommend using conda environments.

conda create -n my_env 
conda activate my_env
pip install bigdl 

To install latest nightly build, use pip install --pre --upgrade bigdl; see Python and Scala user guide for more details.

Getting Started with DLlib

DLlib is a distributed deep learning library for Apache Spark; with DLlib, users can write distributed deep learning applications as standard Spark programs (using either Scala or Python APIs).

First, call initNNContext at the beginning of the code:

import com.intel.analytics.bigdl.dllib.NNContext
val sc = NNContext.initNNContext()

Then, define the BigDL model using Keras-style API:

val input = Input[Float](inputShape = Shape(10))  
val dense = Dense[Float](12).inputs(input)  
val output = Activation[Float]("softmax").inputs(dense)  
val model = Model(input, output)

After that, use NNEstimator to train/predict/evaluate the model using Spark Dataframes and ML pipelines:

val trainingDF = spark.read.parquet("train_data")
val validationDF = spark.read.parquet("val_data")
val scaler = new MinMaxScaler().setInputCol("in").setOutputCol("value")
val estimator = NNEstimator(model, CrossEntropyCriterion())  
        .setBatchSize(size).setOptimMethod(new Adam()).setMaxEpoch(epoch)
val pipeline = new Pipeline().setStages(Array(scaler, estimator))

val pipelineModel = pipeline.fit(trainingDF)  
val predictions = pipelineModel.transform(validationDF)

See the NNframes and Keras API user guides for more details.

Getting Started with Orca

Most AI projects start with a Python notebook running on a single laptop; however, one usually needs to go through a mountain of pains to scale it to handle larger data set in a distributed fashion. The Orca library seamlessly scales out your single node TensorFlow or PyTorch notebook across large clusters (so as to process distributed Big Data).

First, initialize Orca Context:

from bigdl.orca import init_orca_context, OrcaContext

# cluster_mode can be "local", "k8s" or "yarn"
sc = init_orca_context(cluster_mode="yarn", cores=4, memory="10g", num_nodes=2) 

Next, perform data-parallel processing in Orca (supporting standard Spark Dataframes, TensorFlow Dataset, PyTorch DataLoader, Pandas, Pillow, etc.):

from pyspark.sql.functions import array

spark = OrcaContext.get_spark_session()
df = spark.read.parquet(file_path)
df = df.withColumn('user', array('user')) \  
       .withColumn('item', array('item'))

Finally, use sklearn-style Estimator APIs in Orca to perform distributed TensorFlow, PyTorch or Keras training and inference:

from tensorflow import keras
from bigdl.orca.learn.tf.estimator import Estimator

user = keras.layers.Input(shape=[1])  
item = keras.layers.Input(shape=[1])  
feat = keras.layers.concatenate([user, item], axis=1)  
predictions = keras.layers.Dense(2, activation='softmax')(feat)  
model = keras.models.Model(inputs=[user, item], outputs=predictions)  
model.compile(optimizer='rmsprop',  
              loss='sparse_categorical_crossentropy',  
              metrics=['accuracy'])

est = Estimator.from_keras(keras_model=model)  
est.fit(data=df,  
        batch_size=64,  
        epochs=4,  
        feature_cols=['user', 'item'],  
        label_cols=['label'])

See TensorFlow and PyTorch quickstart, as well as the document website, for more details.

Getting Started with RayOnSpark

Ray is an open source distributed framework for emerging AI applications. RayOnSpark allows users to directly run Ray programs on existing Big Data clusters, and directly write Ray code inline with their Spark code (so as to process the in-memory Spark RDDs or DataFrames).

from bigdl.orca import init_orca_context

# cluster_mode can be "local", "k8s" or "yarn"
sc = init_orca_context(cluster_mode="yarn", cores=4, memory="10g", num_nodes=2, init_ray_on_spark=True) 

import ray

@ray.remote
class Counter(object):
      def __init__(self):
          self.n = 0

      def increment(self):
          self.n += 1
          return self.n

counters = [Counter.remote() for i in range(5)]
print(ray.get([c.increment.remote() for c in counters]))

See the RayOnSpark user guide and quickstart for more details.

Getting Started with Chronos

Time series prediction takes observations from previous time steps as input and predicts the values at future time steps. The Chronos library makes it easy to build end-to-end time series analysis by applying AutoML to extremely large-scale time series prediction.

To train a time series model with AutoML, first initialize Orca Context:

from bigdl.orca import init_orca_context

#cluster_mode can be "local", "k8s" or "yarn"
init_orca_context(cluster_mode="yarn", cores=4, memory="10g", num_nodes=2, init_ray_on_spark=True)

Then, create TSDataset for your data.

from bigdl.chronos.data import TSDataset

tsdata_train, tsdata_valid, tsdata_test\
        = TSDataset.from_pandas(df, 
                                dt_col="dt_col", 
                                target_col="target_col", 
                                with_split=True, 
                                val_ratio=0.1, 
                                test_ratio=0.1)

Next, create an AutoTSEstimator.

from bigdl.chronos.autots import AutoTSEstimator

autotsest = AutoTSEstimator(model='lstm')

Finally, call fit on AutoTSEstimator, which applies AutoML to find the best model and hyper-parameters; it returns a TSPipeline which can be used for prediction or evaluation.

#train a pipeline with AutoML support
ts_pipeline = autotsest.fit(data=tsdata_train,
                            validation_data=tsdata_valid)

#predict
ts_pipeline.predict(tsdata_test)

See the Chronos user guide and example for more details.

PPML (Privacy Preserving Machine Learning)

BigDL PPML provides a Trusted Cluster Environment for protecting the end-to-end Big Data AI pipeline. It combines various low level hardware and software security technologies (e.g., Intel SGX, LibOS such as Graphene and Occlum, Federated Learning, etc.), and allows users to run unmodified Big Data analysis and ML/DL programs (such as Apache Spark, Apache Flink, Tensorflow, PyTorch, etc.) in a secure fashion on (private or public) cloud.

See the PPML user guide for more details.

More information

Citing BigDL

If you've found BigDL useful for your project, you may cite the paper as follows:

@inproceedings{SOCC2019_BIGDL,
  title={BigDL: A Distributed Deep Learning Framework for Big Data},
  author={Dai, Jason (Jinquan) and Wang, Yiheng and Qiu, Xin and Ding, Ding and Zhang, Yao and Wang, Yanzhang and Jia, Xianyan and Zhang, Li (Cherry) and Wan, Yan and Li, Zhichao and Wang, Jiao and Huang, Shengsheng and Wu, Zhongyuan and Wang, Yang and Yang, Yuhao and She, Bowen and Shi, Dongjie and Lu, Qi and Huang, Kai and Song, Guoqiong},
  booktitle={Proceedings of the ACM Symposium on Cloud Computing},
  publisher={Association for Computing Machinery},
  pages={50--60},
  year={2019},
  series={SoCC'19},
  doi={10.1145/3357223.3362707},
  url={https://arxiv.org/pdf/1804.05839.pdf}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].