All Projects β†’ castanan β†’ W2v

castanan / W2v

Licence: mit
Word2Vec models with Twitter data using Spark. Blog:

Projects that are alternatives of or similar to W2v

Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+1990.63%)
Mutual labels:  jupyter-notebook, data-science, spark, pyspark
Optimus
🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (+1440.63%)
Mutual labels:  jupyter-notebook, data-science, spark, pyspark
Spark Practice
Apache Spark (PySpark) Practice on Real Data
Stars: ✭ 200 (+212.5%)
Mutual labels:  jupyter-notebook, spark, pyspark
Mydatascienceportfolio
Applying Data Science and Machine Learning to Solve Real World Business Problems
Stars: ✭ 227 (+254.69%)
Mutual labels:  jupyter-notebook, data-science, spark
H2o 3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Stars: ✭ 5,656 (+8737.5%)
Mutual labels:  jupyter-notebook, data-science, spark
Handyspark
HandySpark - bringing pandas-like capabilities to Spark dataframes
Stars: ✭ 158 (+146.88%)
Mutual labels:  jupyter-notebook, spark, pyspark
Scalable Data Science Platform
Content for architecting a data science platform for products using Luigi, Spark & Flask.
Stars: ✭ 158 (+146.88%)
Mutual labels:  jupyter-notebook, data-science, spark
Agile data code 2
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
Stars: ✭ 413 (+545.31%)
Mutual labels:  jupyter-notebook, data-science, spark
Spark Tdd Example
A simple Spark TDD example
Stars: ✭ 23 (-64.06%)
Mutual labels:  jupyter-notebook, spark, pyspark
Sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
Stars: ✭ 954 (+1390.63%)
Mutual labels:  jupyter-notebook, spark, pyspark
Data Science Cookbook
πŸŽ“ Jupyter notebooks from UFC data science course
Stars: ✭ 60 (-6.25%)
Mutual labels:  jupyter-notebook, data-science, spark
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (+134.38%)
Mutual labels:  jupyter-notebook, spark, pyspark
Pyspark Learning
Updated repository
Stars: ✭ 147 (+129.69%)
Mutual labels:  jupyter-notebook, spark, pyspark
Azure Cosmosdb Spark
Apache Spark Connector for Azure Cosmos DB
Stars: ✭ 165 (+157.81%)
Mutual labels:  jupyter-notebook, spark, pyspark
Python Bigdata
Data science and Big Data with Python
Stars: ✭ 112 (+75%)
Mutual labels:  jupyter-notebook, data-science, spark
kafka-compose
🎼 Docker compose files for various kafka stacks
Stars: ✭ 32 (-50%)
Mutual labels:  twitter, spark, pyspark
Pyspark Cheatsheet
🐍 Quick reference guide to common patterns & functions in PySpark.
Stars: ✭ 108 (+68.75%)
Mutual labels:  data-science, spark, pyspark
Pyspark Example Project
Example project implementing best practices for PySpark ETL jobs and applications.
Stars: ✭ 633 (+889.06%)
Mutual labels:  data-science, spark, pyspark
Pixiedust
Python Helper library for Jupyter Notebooks
Stars: ✭ 998 (+1459.38%)
Mutual labels:  jupyter-notebook, data-science, spark
Pysparkgeoanalysis
🌐 Interactive Workshop on GeoAnalysis using PySpark
Stars: ✭ 63 (-1.56%)
Mutual labels:  jupyter-notebook, spark, pyspark

Spark-based machine learning for capturing word meanings

In this repo, you will find out how to build Word2Vec models with Twitter data. For an end to end tutorial on how to build models on IBM's Watson Studio, please chech this repo.

Pre-reqs: install Python, numpy and Apache Spark

I.) Installing Anaconda installs Python, numpy, among other Python packages. If interested go here https://www.continuum.io/downloads

II.) Download and Install Apache Spark go here: http://spark.apache.org/downloads.html

This steps were useful for me to install Spark 1.5.1 on a Mac https://github.com/castanan/w2v/blob/master/Install%20Spark%20On%20Mac.txt

III.) Added a notebook here https://github.com/castanan/w2v/blob/master/mllib-scripts/Word2Vec with Twitter Data usign Spark RDDs.ipynb and the good news are that Spark comes with Jupyter + Pyspark integrated. This notebook can be invoked from the shell by typing the command: IPYTHON_OPTS="notebook" ./bin/pyspark if you are sitting on YOUR-SPARK-HOME.

Make sure that your pyspark is working

I.) Go to your spark home directory

cd YOUR-SPARK-HOME/bin

II.) Open a pyspark shell by typing the command

./pyspark

or Pyspark with Jupyter by typing the command

IPYTHON_OPTS="notebook" ./bin/pyspark

III.) print your spark context by typing sc in the pyspark shell, you should get something like this:

![image of pyspark shell] (https://github.com/castanan/w2v/blob/master/images/pyspark-shell.png)

Get the Repo

git clone https://github.com/castanan/w2v.git

cd /YOUR-PATH-TO-REPO/w2v

Get the Data

Download (without uncompressing) some tweets from here. The tweets.gz file contains a 10% sample (using Twitter decahose API) of a 15 minute batch of the public tweets from December 23rd. The size of this compressed file is 116MB (compression ratio is about 10 to 1).

Note: there is no need to uncompress the file, just download the tweets.gz file and save it on the repo /YOUR-PATH-TO-REPO/w2v/data/.

There are 2 options to perform the Twitter analysis:

  1. (suggested) use dataframes and Spark ML (March 2016). see ml-scripts/README.md

  2. use rdd's and Spark MLlib (October 2015). see mllib-scripts/README.md

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].