Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Data Accelerator for Apache Spark simplifies onboarding to Streaming of Big Data. It offers a rich, easy to use experience to help with creation, editing and management of Spark jobs on Azure HDInsights or Databricks while enabling the full power of the Spark engine.

Stars: ✭ 247 (-73.41%)

Mutual labels: spark, apache-spark, spark-streaming, streaming

Spark States

Custom state store providers for Apache Spark

Stars: ✭ 83 (-91.07%)

Mutual labels: spark, apache-spark, spark-streaming

Splash

Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange

Stars: ✭ 105 (-88.7%)

Mutual labels: spark, bigdata, apache-spark

Sparta

Real Time Analytics and Data Pipelines based on Spark Streaming

Stars: ✭ 513 (-44.78%)

Mutual labels: spark, spark-streaming, streaming

Sparkrdma

RDMA accelerated, high-performance, scalable and efficient ShuffleManager plugin for Apache Spark

Stars: ✭ 215 (-76.86%)

Mutual labels: spark, bigdata, apache-spark

Dpark

Python clone of Spark, a MapReduce alike framework in Python

Stars: ✭ 2,668 (+187.19%)

Mutual labels: spark, bigdata, mapreduce

Streaming Readings

Streaming System 相关的论文读物

Stars: ✭ 554 (-40.37%)

Mutual labels: apache-spark, spark-streaming, streaming

Spark With Python

Fundamentals of Spark with Python (using PySpark), code examples

Stars: ✭ 150 (-83.85%)

Mutual labels: dataframe, spark, apache-spark

Bigdata Notebook

Stars: ✭ 100 (-89.24%)

Mutual labels: spark, bigdata, streaming

Bigdata Notes

大数据入门指南 ⭐

Stars: ✭ 10,991 (+1083.1%)

Mutual labels: spark, bigdata, mapreduce

Whylogs Java

Profile and monitor your ML data pipeline end-to-end

Stars: ✭ 164 (-82.35%)

Mutual labels: dataset, spark, apache-spark

qs-hadoop

大数据生态圈学习

Stars: ✭ 18 (-98.06%)

Mutual labels: bigdata, spark-streaming, mapreduce

Big Data Engineering Coursera Yandex

Big Data for Data Engineers Coursera Specialization from Yandex

Stars: ✭ 71 (-92.36%)

Mutual labels: spark, bigdata, mapreduce

Bigdata Interview

🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结

Stars: ✭ 857 (-7.75%)

Mutual labels: spark, bigdata, mapreduce

Real Time Stream Processing Engine

This is an example of real time stream processing using Spark Streaming, Kafka & Elasticsearch.

Stars: ✭ 37 (-96.02%)

Mutual labels: spark, apache-spark, spark-streaming

aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

Stars: ✭ 111 (-88.05%)

Mutual labels: spark, apache-spark, dataframe

leaflet heatmap

简单的可视化湖州通话数据假设数据量很大，没法用浏览器直接绘制热力图，把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后，再使用Apache Spark绘制热力图，然后用leafletjs加载OpenStreetMap图层和热力图图层，以达到良好的交互效果。现在使用Apache Spark实现绘制，可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法，并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .

Stars: ✭ 13 (-98.6%)

Mutual labels: spark, apache-spark, bigdata

View All Similar Projects ➔

Mobius development is deprecated and has been superseded by a more recent version '.NET for Apache Spark' from Microsoft (Website | GitHub) that runs on Azure HDInsight Spark, Amazon EMR Spark, Azure & AWS Databricks.

Mobius: C# API for Spark

Mobius provides C# language binding to Apache Spark enabling the implementation of Spark driver program and data processing operations in the languages supported in the .NET framework like C# or F#.

For example, the word count sample in Apache Spark can be implemented in C# as follows :

var lines = sparkContext.TextFile(@"hdfs://path/to/input.txt");  
var words = lines.FlatMap(s => s.Split(' '));
var wordCounts = words.Map(w => new Tuple<string, int>(w.Trim(), 1))  
                      .ReduceByKey((x, y) => x + y);  
var wordCountCollection = wordCounts.Collect();  
wordCounts.SaveAsTextFile(@"hdfs://path/to/wordcount.txt");

A simple DataFrame application using TempTable may look like the following:

var reqDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv");
var metricDataFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv");
reqDataFrame.RegisterTempTable("requests");
metricDataFrame.RegisterTempTable("metrics");
// C0 - guid in requests DataFrame, C3 - guid in metrics DataFrame  
var joinDataFrame = GetSqlContext().Sql(  
    "SELECT joinedtable.datacenter" +
         ", MAX(joinedtable.latency) maxlatency" +
         ", AVG(joinedtable.latency) avglatency " +
    "FROM (" +
       "SELECT a.C1 as datacenter, b.C6 as latency " +  
       "FROM requests a JOIN metrics b ON a.C0  = b.C3) joinedtable " +   
    "GROUP BY datacenter");
joinDataFrame.ShowSchema();
joinDataFrame.Show();

A simple DataFrame application using DataFrame DSL may look like the following:

// C0 - guid, C1 - datacenter
var reqDataFrame = sqlContext.TextFile(@"hdfs://path/to/requests.csv")  
                             .Select("C0", "C1");    
// C3 - guid, C6 - latency   
var metricDataFrame = sqlContext.TextFile(@"hdfs://path/to/metrics.csv", ",", false, true)
                                .Select("C3", "C6"); //override delimiter, hasHeader & inferSchema
var joinDataFrame = reqDataFrame.Join(metricDataFrame, reqDataFrame["C0"] == metricDataFrame["C3"])
                                .GroupBy("C1");
var maxLatencyByDcDataFrame = joinDataFrame.Agg(new Dictionary<string, string> { { "C6", "max" } });
maxLatencyByDcDataFrame.ShowSchema();
maxLatencyByDcDataFrame.Show();

A simple Spark Streaming application that processes messages from Kafka using C# may be implemented using the following code:

StreamingContext sparkStreamingContext = StreamingContext.GetOrCreate(checkpointPath, () =>
    {
      var ssc = new StreamingContext(sparkContext, slideDurationInMillis);
      ssc.Checkpoint(checkpointPath);
      var stream = KafkaUtils.CreateDirectStream(ssc, topicList, kafkaParams, perTopicPartitionKafkaOffsets);
      //message format: [timestamp],[loglevel],[logmessage]
      var countByLogLevelAndTime = stream
                                    .Map(kvp => Encoding.UTF8.GetString(kvp.Value))
                                    .Filter(line => line.Contains(","))
                                    .Map(line => line.Split(','))
                                    .Map(columns => new Tuple<string, int>(
                                                          string.Format("{0},{1}", columns[0], columns[1]), 1))
                                    .ReduceByKeyAndWindow((x, y) => x + y, (x, y) => x - y,
                                                          windowDurationInSecs, slideDurationInSecs, 3)
                                    .Map(logLevelCountPair => string.Format("{0},{1}",
                                                          logLevelCountPair.Key, logLevelCountPair.Value));
      countByLogLevelAndTime.ForeachRDD(countByLogLevel =>
      {
          foreach (var logCount in countByLogLevel.Collect())
              Console.WriteLine(logCount);
      });
      return ssc;
    });
sparkStreamingContext.Start();
sparkStreamingContext.AwaitTermination();

For more code samples, refer to Mobius\examples directory or Mobius\csharp\Samples directory.

API Documentation

Refer to Mobius C# API documentation for the list of Spark's data processing operations supported in Mobius.

API Usage

Mobius API usage samples are available at:

Examples folder which contains standalone C# and F# projects that can be used as templates to start developing Mobius applications
Samples project which uses a comprehensive set of Mobius APIs to implement samples that are also used for functional validation of APIs
Mobius performance test scenarios implemented in C# and Scala for side by side comparison of Spark driver code

Documents

Refer to the docs folder for design overview and other info on Mobius

Build Status

Ubuntu 14.04.3 LTS	Windows	Unit test coverage

Getting Started

	Windows	Linux
Build & run unit tests	Build in Windows	Build in Linux
Run samples (functional tests) in local mode	Samples in Windows	Samples in Linux
Run examples in local mode	Examples in Windows	Examples in Linux
Run Mobius app	Standalone cluster YARN cluster	Linux cluster Azure HDInsight Spark Cluster AWS EMR Spark Cluster
Run Mobius Shell	Local YARN	Not supported yet

Useful Links

Supported Spark Versions

Mobius is built and tested with Apache Spark 1.4.1, 1.5.2, 1.6.* and 2.0.

Releases

Mobius releases are available at https://github.com/Microsoft/Mobius/releases. References needed to build C# Spark driver applicaiton using Mobius are also available in NuGet

Refer to mobius-release-info.md for the details on versioning policy and the contents of the release.

License

Mobius is licensed under the MIT license. See LICENSE file for full license information.

Community

Mobius project welcomes contributions. To contribute, follow the instructions in CONTRIBUTING.md
Options to ask your question to the Mobius community
- create issue on GitHub
- create post with "sparkclr" tag in Stack Overflow
- join chat at Mobius room in Gitter
- tweet @MobiusForSpark

Code of Conduct

This project has adopted the [email protected] with any additional questions or comments.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 929

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (71) 🔗