All Projects → douban → Dpark

douban / Dpark

Licence: bsd-3-clause
Python clone of Spark, a MapReduce alike framework in Python

Programming Languages

python
139335 projects - #7 most used programming language
javascript
184084 projects - #8 most used programming language
c
50402 projects - #5 most used programming language
HTML
75241 projects
CSS
56736 projects
shell
77523 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to Dpark

Mobius
C# and F# language binding and extensions to Apache Spark
Stars: ✭ 929 (-65.18%)
Mutual labels:  spark, bigdata, mapreduce
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+311.96%)
Mutual labels:  spark, bigdata, mapreduce
Bigdata Interview
🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Stars: ✭ 857 (-67.88%)
Mutual labels:  spark, bigdata, mapreduce
data processing course
Some class materials for a data processing course using PySpark
Stars: ✭ 50 (-98.13%)
Mutual labels:  spark, bigdata, stream-processing
Big Data Engineering Coursera Yandex
Big Data for Data Engineers Coursera Specialization from Yandex
Stars: ✭ 71 (-97.34%)
Mutual labels:  spark, bigdata, mapreduce
Bigdata Notebook
Stars: ✭ 100 (-96.25%)
Mutual labels:  spark, bigdata
Splash
Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange
Stars: ✭ 105 (-96.06%)
Mutual labels:  spark, bigdata
Flink Learning
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
Stars: ✭ 11,378 (+326.46%)
Mutual labels:  spark, stream-processing
Hadoopcryptoledger
Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
Stars: ✭ 126 (-95.28%)
Mutual labels:  spark, bigdata
Repository
个人学习知识库涉及到数据仓库建模、实时计算、大数据、Java、算法等。
Stars: ✭ 92 (-96.55%)
Mutual labels:  spark, mapreduce
Awesome Bigdata
A curated list of awesome big data frameworks, ressources and other awesomeness.
Stars: ✭ 10,478 (+292.73%)
Mutual labels:  bigdata, stream-processing
Spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
Stars: ✭ 1,721 (-35.49%)
Mutual labels:  spark, bigdata
Logisland
Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.
Stars: ✭ 97 (-96.36%)
Mutual labels:  spark, stream-processing
Sparktutorial
Source code for James Lee's Aparch Spark with Java course
Stars: ✭ 105 (-96.06%)
Mutual labels:  spark, bigdata
Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (-49.85%)
Mutual labels:  spark, bigdata
Lambda Arch
Applying Lambda Architecture with Spark, Kafka, and Cassandra.
Stars: ✭ 111 (-95.84%)
Mutual labels:  spark, bigdata
Hudi
Upserts, Deletes And Incremental Processing on Big Data.
Stars: ✭ 2,586 (-3.07%)
Mutual labels:  bigdata, stream-processing
Azure Event Hubs Spark
Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs
Stars: ✭ 140 (-94.75%)
Mutual labels:  spark, bigdata
Redisgears
Dynamic execution framework for your Redis data
Stars: ✭ 152 (-94.3%)
Mutual labels:  stream-processing, mapreduce
Javaorbigdata Interview
Java开发者或者大数据开发者面试知识点整理
Stars: ✭ 203 (-92.39%)
Mutual labels:  spark, bigdata

DPark

pypi status ci status Join the chat at https://gitter.im/douban/dpark

DPark is a Python clone of Spark, MapReduce(R) alike computing framework supporting iterative computation.

Installation

## Due to the use of C extensions, some libraries need to be installed first.

$ sudo apt-get install libtool pkg-config build-essential autoconf automake
$ sudo apt-get install python-dev
$ sudo apt-get install libzmq-dev

## Then just pip install dpark (``sudo`` maybe needed if you encounter permission problem).

$ pip install dpark

Example

for word counting (wc.py):

from dpark import DparkContext
ctx = DparkContext()
file = ctx.textFile("/tmp/words.txt")
words = file.flatMap(lambda x:x.split()).map(lambda x:(x,1))
wc = words.reduceByKey(lambda x,y:x+y).collectAsMap()
print wc

This script can run locally or on a Mesos cluster without any modification, just using different command-line arguments:

$ python wc.py
$ python wc.py -m process
$ python wc.py -m host[:port]

See examples/ for more use cases.

Configuration

DPark can run with Mesos 0.9 or higher.

If a $MESOS_MASTER environment variable is set, you can use a shortcut and run DPark with Mesos just by typing

$ python wc.py -m mesos

$MESOS_MASTER can be any scheme of Mesos master, such as

$ export MESOS_MASTER=zk://zk1:2181,zk2:2181,zk3:2181/mesos_master

In order to speed up shuffling, you should deploy Nginx at port 5055 for accessing data in DPARK_WORK_DIR (default is /tmp/dpark), such as:

server {
        listen 5055;
        server_name localhost;
        root /tmp/dpark/;
}

UI

2 DAGs:

  1. stage graph: stage is a running unit, contain a set of task, each run same ops for a split of rdd.
  2. use api callsite graph

UI when running

Just open the url from log like start listening on Web UI http://server_01:40812 .

UI after running

  1. before run, config LOGHUB & LOGHUB_PATH_FORMAT in dpark.conf, pre-create LOGHUB_DIR.
  2. get log hubdir from log like logging/prof to LOGHUB_DIR/2018/09/27/16/b2e3349b-9858-4153-b491-80699c757485-8754, which in clude mesos framework id.
  3. run dpark_web.py -p 9999 -l LOGHUB_DIR/2018/09/27/16/b2e3349b-9858-4153-b491-80699c757485-8728/, dpark_web.py is in tools/

UI examples for features

show sharing shuffle map output

rdd = DparkContext().makeRDD([(1,1)]).map(m).groupByKey()
rdd.map(m).collect()
rdd.map(m).collect()

images/share_mapoutput.png

combine nodes iff with same lineage, form a logic tree inside stage, then each node contain a PIPELINE of rdds.

rdd1 = get_rdd()
rdd2 = dc.union([get_rdd() for i in range(2)])
rdd3 = get_rdd().groupByKey()
dc.union([rdd1, rdd2, rdd3]).collect()

images/unions.png

More docs (in Chinese)

https://dpark.readthedocs.io/zh_CN/latest/

https://github.com/jackfengji/test_pro/wiki

Mailing list: [email protected] (http://groups.google.com/group/dpark-users)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].