All Projects → pandening → storm-ml

pandening / storm-ml

Licence: Apache-2.0 license
an online learning algorithm library for Storm

Programming Languages

java
68154 projects - #9 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to storm-ml

Storm Dynamic Spout
A framework for building spouts for Apache Storm and a Kafka based spout for dynamically skipping messages to be processed later.
Stars: ✭ 40 (+122.22%)
Mutual labels:  storm, stream-processing
talaria
TalariaDB is a distributed, highly available, and low latency time-series database for Presto
Stars: ✭ 148 (+722.22%)
Mutual labels:  big-data, stream-processing
Streaming Readings
Streaming System 相关的论文读物
Stars: ✭ 554 (+2977.78%)
Mutual labels:  storm, stream-processing
Hazelcast
Open-source distributed computation and storage platform
Stars: ✭ 4,662 (+25800%)
Mutual labels:  big-data, stream-processing
Hazelcast Jet
Distributed Stream and Batch Processing
Stars: ✭ 855 (+4650%)
Mutual labels:  big-data, stream-processing
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+60961.11%)
Mutual labels:  big-data, storm
Smooks
An extensible Java framework for building XML and non-XML streaming applications
Stars: ✭ 293 (+1527.78%)
Mutual labels:  big-data, stream-processing
Kafka Streams
equivalent to kafka-streams 🐙 for nodejs ✨🐢🚀✨
Stars: ✭ 613 (+3305.56%)
Mutual labels:  big-data, stream-processing
Storm
Mirror of Apache Storm
Stars: ✭ 6,297 (+34883.33%)
Mutual labels:  big-data, storm
Logisland
Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.
Stars: ✭ 97 (+438.89%)
Mutual labels:  big-data, stream-processing
Storm Doc Zh
Apache Storm 官方文档中文版
Stars: ✭ 142 (+688.89%)
Mutual labels:  big-data, storm
ByteSlice
"Byteslice: Pushing the envelop of main memory data processing with a new storage layout" (SIGMOD'15)
Stars: ✭ 24 (+33.33%)
Mutual labels:  big-data
xcast
A High-Performance Data Science Toolkit for the Earth Sciences
Stars: ✭ 28 (+55.56%)
Mutual labels:  big-data
bigquery-kafka-connect
☁️ nodejs kafka connect connector for Google BigQuery
Stars: ✭ 17 (-5.56%)
Mutual labels:  big-data
flink-connectors
Apache Flink connectors for Pravega.
Stars: ✭ 84 (+366.67%)
Mutual labels:  stream-processing
mage
MAGE - Memgraph Advanced Graph Extensions 🔮
Stars: ✭ 89 (+394.44%)
Mutual labels:  stream-processing
Big-Data-Demo
基于Vue、three.js、echarts,数据可视化展示项目,包含三维模型导入交互、三维模型标注等功能
Stars: ✭ 146 (+711.11%)
Mutual labels:  big-data
kafka-shell
⚡A supercharged, interactive Kafka shell built on top of the existing Kafka CLI tools.
Stars: ✭ 107 (+494.44%)
Mutual labels:  stream-processing
arrow-datafusion
Apache Arrow DataFusion SQL Query Engine
Stars: ✭ 2,360 (+13011.11%)
Mutual labels:  big-data
artml
ARTML- Real time learning
Stars: ✭ 20 (+11.11%)
Mutual labels:  stream-processing

open-streamer

What is open-streamer ?

Open-Streamer is a library base on Storm platform,it is described by Trident api.and it focus on the real-time algorithm and online learnning algorithm,this library has implemented some classical algorithms type,like classifier, clustering,Regression,Cardinality,and Average Counting.etc,you can build some smart applications with this library over big data environment,it's easy to use this library on your project.I will give the start-tutorial for you to help you start to use this library.This library is not so ORIGINAL,you must know the Machine Learnning algorithm Library Over Storm : Trident-Ml,open-streamer extends trident-ml,Thanks Trident-ml's open source spirit.

Open-Stream Algorithms Overviews:

  • Average
    • Moving Average[1]
    • EWMA average[2]
  • Cardinality
    • LogLog Cardinality[3]
    • HyperLogLog cardinality[4]
    • Adaptive Counting Cardinality[5]
    • Linear Counting
  • Classification
    • Committee Classifier[6]
    • Passive Aggressive Classifier[7]
    • Perceptron Classifier[8]
    • Winnow Classifier[9]
    • Balanced Winnow Classifier[10]
    • Modify Banalced Winnow Classifier[11]
  • Clustering
    • Birch
    • Canopy
    • K-means
  • Frequency Counting
    • Count Sketch[12]
    • Lossy Counting[13]
    • Stick Sampling Counting[14]
    • Space Saving[15]
    • Top-k
  • Regression
    • Ftrl regression[16]
    • Perceptron Regression[17]
    • Passive Aggression Regression

Tutorial

You should have a spout for your Topology(DAG),you can Reference https://github.com/pandening/open-streamer/blob/master/src/main/java/com/hujian/trident/ml/examples/data/DoubleSpout.java Then,the data flow from spout will needto be transformed to an Object instance of com.hujian.trident.ml.core.Instance,there is a good and sample instance creator for you in the package: com.hujian.trident.ml.core.InstanceCreator , you can use this creator to create an instance and then emit the data flow to downstream.you should know about Trident's Api,like Function,Filter,StateUpdate,etc,for example,if you want to do some filter work on the data flow,you can let the data flow into a filter of Trident,then emit the data that you want to the downstram.

you can builder your topology with Trident Apis,for example,you can build an topology to run an average algorithm,like Moving Average, the only thing you need to do is adjust the runtime parameter , the follow java code will let you know how to use this library.

average is instance of IAverage,you can let average = new MovingAverage or EWMAAverage.

        TridentTopology tridentTopology = new TridentTopology();

        tridentTopology.newStream(topologyName,new DoubleSpout(10))
                .each(new Fields("item","frequency","type"),
                        new CountEntryInstanceCreator<Double>(),new Fields("instance"))
                .partitionPersist(new MemoryMapState.Factory(),new Fields("instance"),
                        new AverageModelUpdater("average-model-update",average),new Fields("average"))
                .newValuesStream()
                .each(new Fields("average"),new ShowAverageFunction(),new Fields("done"))
                .each(new Fields("done"),new ShowAverageFunction(),new Fields(""));

There is an Integrated java code https://github.com/pandening/open-streamer/blob/master/src/main/java/com/hujian/trident/ml/GPAPPBuilder.java

A complex demo for this library

Hybrid Classifier , a complex demo for this library,you can add Arbitrary Classifiers to the factory,the factory will choose some of its to classify the instrance,in the actual demo,I use 4 classifiers to test the hybrid classifier,a Committee Classifier,and 3 Passive Aggressive Classifier(Pa,Pa-I,PA-II),the data flow will be classified by the Committee Classiffer firstly,the Classification result will store at a singleton class,you can implement your storage by implement IStore,then the data flow will continue flow to downstream, the PA Classifier will receive the instance,the PA classifier will first do classify,get the classication result,then   get the classification list of this instance by instance id(each instance will be signed a instanceID),then the project will judge, if Committee's classication result equals PA's result,then end of classifying,get the classification result,and remove the instance from storage,then go to a Trident Function named EndFunction,do some print work,you can do more complexer work here,and,if Comittee's result != PA's result,the data will continue to next classifier PA-I,do some work like PA classifier,and if necessary,the PA-II classifier will do the same work like PA,PA-I,and after PA-II classifier,if there is no same classification result in the result list of this instance,the program will vote an classifier's result to you according to a weight vector,this vector will maintain by each Classifier,if any Classifier can get the classification result,the weight vector will be updated,the rules to update is: -(1) scanning each classifier's classification result,if the Classifier's classification result is the final result,then the classifier's weight will add 1L -(2) after updating the weight vector,for some reasons,we need to normalize the weight vector's sum to 100(or others small value) in the final classifier,the program also do some statistic work,like Right/Error count,you can print the information to watch the process of algorithm running.

Relevant Knowledge

  • Storm [18]
  • Trident [19]
  • Trident-ml [20]
  • Mahout [21]

Authors

Jian Hu,NanKai Edu,Tian Jin,China,2013.9 - 2017.6 (compute science and technology)

Email:[email protected]

Copyright and license

Copyright 2013-2017 Hu Jian

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Links & References

[1] Key Words: Moving Average , Goolgle

[2] http://blog.csdn.net/x_i_y_u_e/article/details/44194761

[3] http://blog.csdn.net/keshixi/article/details/46730231

[4] http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

[5] Fast and Accurate Traffic Matrix Measurement Using Adaptive Cardinality Counting

[6] A Multi-class Linear Learning Algorithm Related to Winnow

[7] Online Passive-Aggressive Algorithms

[8] http://www.cnblogs.com/jerrylead/archive/2011/04/18/2020173.html

[9] https://en.wikipedia.org/wiki/Winnow_(algorithm)

[10] Single-Pass Online Learning: Performance, VotingSchemes and Online Feature Selection

[11] Gender Identification on Twitter Using the Modified Balanced Winnow

[12] http://dimacs.rutgers.edu/~graham/pubs/papers/freqvldbj.pdf

[13] Approximate Frequency Counts over Data Streams

[14] Approximate Frequency Counts over Data Streams

[15] Efficient Computation of Frequent and Top-k Elements in Data Streams

[16] Ad Click Prediction: a View from the Trenches

[17] Online Passive-Aggressive Algorithms

[18] http://storm.apache.org/

[19] https://github.com/apache/storm/tree/master/storm-core/src/jvm/org/apache/storm/trident

[20] https://github.com/pmerienne/trident-ml

[21] http://mahout.apache.org/

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].