All Projects → amir-rahnama → Pyspark Twitter Stream Mining

amir-rahnama / Pyspark Twitter Stream Mining

Licence: mit
Real-time Machine Learning with Apache Spark on Twitter Public Stream

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Pyspark Twitter Stream Mining

Spark Nkp
Natural Korean Processor for Apache Spark
Stars: ✭ 50 (-21.87%)
Mutual labels:  spark
Model Serving Tutorial
Code and presentation for Strata Model Serving tutorial
Stars: ✭ 57 (-10.94%)
Mutual labels:  spark
Waimak
Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
Stars: ✭ 60 (-6.25%)
Mutual labels:  spark
Spark Submit Ui
This is a based on playframwork for submit spark app
Stars: ✭ 53 (-17.19%)
Mutual labels:  spark
Net.jgp.labs.spark
Apache Spark examples exclusively in Java
Stars: ✭ 55 (-14.06%)
Mutual labels:  spark
Rumble
⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
Stars: ✭ 58 (-9.37%)
Mutual labels:  spark
Awesome Recommendation Engine
The purpose of this tiny project is to put things together with the know how that i learned from the course big data expert from formacionhadoop.com The idea is to show how to play with apache spark streaming, kafka,mongo, spark machine learning algorithms.
Stars: ✭ 47 (-26.56%)
Mutual labels:  spark
Spark Doc Zh
Apache Spark 官方文档中文版
Stars: ✭ 1,126 (+1659.38%)
Mutual labels:  spark
Awesome Pulsar
A curated list of Pulsar tools, integrations and resources.
Stars: ✭ 57 (-10.94%)
Mutual labels:  spark
Data Science Cookbook
🎓 Jupyter notebooks from UFC data science course
Stars: ✭ 60 (-6.25%)
Mutual labels:  spark
Utils4s
scala、spark使用过程中,各种测试用例以及相关资料整理
Stars: ✭ 1,070 (+1571.88%)
Mutual labels:  spark
Pulsar Spark
When Apache Pulsar meets Apache Spark
Stars: ✭ 55 (-14.06%)
Mutual labels:  spark
Pyspark Examples
Code examples on Apache Spark using python
Stars: ✭ 58 (-9.37%)
Mutual labels:  spark
Play Spark Scala
Stars: ✭ 51 (-20.31%)
Mutual labels:  spark
Silex
something to help you spark
Stars: ✭ 61 (-4.69%)
Mutual labels:  spark
Apache Spark Internals
The Internals of Apache Spark
Stars: ✭ 1,045 (+1532.81%)
Mutual labels:  spark
Docker Spark Cluster
A Spark cluster setup running on Docker containers
Stars: ✭ 57 (-10.94%)
Mutual labels:  spark
Pysparkgeoanalysis
🌐 Interactive Workshop on GeoAnalysis using PySpark
Stars: ✭ 63 (-1.56%)
Mutual labels:  spark
Roffildlibrary
Library for MQL5 (MetaTrader) with Python, Java, Apache Spark, AWS
Stars: ✭ 63 (-1.56%)
Mutual labels:  spark
Zemberek Nlp Server
Zemberek Türkçe NLP Java Kütüphanesi üzerine REST Docker Sunucu
Stars: ✭ 60 (-6.25%)
Mutual labels:  spark

Real-Time Twitter Mining with Apache Spark (PySpark)

Motivation

I love python and I love Machine Learning, specially in real-time. Up to now, Apache Spark does not have any Twitter Stream integration, so I put up a little workaround to be able to use spark on twitter data. Even better, I integrated the result into visualizations. So far, there is only a d3 wordcloud but I am planning to add more.

Getting Started

  • Install Docker and Docker-compose
  • Install Python and Pip
  • Install dependecies: pip install psutil && pip install tweepy && pip install websockets
  • Make sure you have Apache Spark installed. This repo works with spark-1.5.1-bin-hadoop2.6 verison perfectly. After that, you just need to remember where you extracted spark, we call it $SPARK_HOME, Ogey?
  • Get your API keys from [https://dev.twitter.com/](Twitter Developers) and put them in data/config.json.
  • set docker in your etc/hosts to point to your machine

How to Run the Example?

First, run the Kafka server with the following command:

docker-compose up

Then, fire up the stream source:

python twitter_stream.py

Now submit the trending_keywords_sparkjob.py to spark-submit:

$SPARK_HOME/bin/spark-submit --jars jar/spark-streaming-kafka-assembly_2.10-1.5.1 sparkjob.py

You will start to see the most frequently used words in the tweets from your opened stream like this:

-------------------------------------------
Time: 2015-12-18 21:11:17
-------------------------------------------
(u'python', 461)
(u'url', 282)
(u'#python', 125)
(u'user', 102)
(u'como', 70)
(u'de', 59)
(u'con', 43)
(u'monty', 42)
(u'este', 36)
(u'culebra', 35)
...

After that, you are gonna have a stateful count of all realtime feed of twitter stream with most used words (stop words and non-alpha numeric words are striped). The log will show you the top 10 words sorted by number of appearence. Note that spark will create a folder called twitter-checkpoint to keep state of the application and puts some rules for failover computation there.

You should see the most frequent words in Tweets that have Python in them. Why Python? Becuase it's awesome! For now, change the query here. Also, [change the topic in the Kafka example] ().

###Real-time D3.js WordCloud

First, make sure that all the previous steps are running simultaneously. Then:

cd html

bower install

python -m SimpleHTTPServer 9000

Go to http://localhost:9000 and see the running wordcloud updating every 10 seoncds.

###Share & Support Please help me make this repo a better project by sharing your ideas, forks, creating issues and features you need, I will appreciate any feedbacks. Send me a tweet at [https://twitter.com/_ambodi](_ambodi @ Twitter ).

###License See the LICENSE file for license rights and limitations (MIT).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].