All Projects → rogaha → data-processing-pipeline

rogaha / data-processing-pipeline

Licence: other
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

Programming Languages

HTML
75241 projects
scala
5932 projects
python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to data-processing-pipeline

EFCore.Cassandra
Entity Framework Core provider for Cassandra
Stars: ✭ 23 (-70.89%)
Mutual labels:  cassandra
doclt
Digital Ocean Command Line Tool
Stars: ✭ 43 (-45.57%)
Mutual labels:  digital-ocean
cdrs-tokio
High-level async Cassandra client written in 100% Rust.
Stars: ✭ 54 (-31.65%)
Mutual labels:  cassandra
opennms-drift-kubernetes
OpenNMS Drift Deployment in Kubernetes for testing and learning purposes
Stars: ✭ 15 (-81.01%)
Mutual labels:  cassandra
GameHer
Gameher is a French association aspiring to develop gender diversity in the fields of video games, esport and streaming. We want to give players the tools they need to flourish and evolve in the fields of video games, esport and streaming.
Stars: ✭ 35 (-55.7%)
Mutual labels:  digital-ocean
amazon-keyspaces-toolkit
Docker Image /tools for working with Amazon Keyspaces.
Stars: ✭ 25 (-68.35%)
Mutual labels:  cassandra
FsCassy
Functional F# API for Cassandra
Stars: ✭ 20 (-74.68%)
Mutual labels:  cassandra
hotsub
Command line tool to run batch jobs concurrently with ETL framework on AWS or other cloud computing resources
Stars: ✭ 29 (-63.29%)
Mutual labels:  docker-machine
verticegateway
REST API server with built in auth, interface to ScyllaDB/Cassandra
Stars: ✭ 25 (-68.35%)
Mutual labels:  cassandra
docker-workshop
Workshop on Docker, Containers and Golang
Stars: ✭ 20 (-74.68%)
Mutual labels:  docker-machine
alpine-kong
alpine-kong
Stars: ✭ 15 (-81.01%)
Mutual labels:  cassandra
terraform-provider-dockermachine
Docker machine provider for Terraform
Stars: ✭ 20 (-74.68%)
Mutual labels:  docker-machine
hector
a high level client for cassandra
Stars: ✭ 51 (-35.44%)
Mutual labels:  cassandra
docker-elassandra
Docker Image packaging for Elassandra
Stars: ✭ 25 (-68.35%)
Mutual labels:  cassandra
node-docker-machine
Programmatic API to Docker Machine.
Stars: ✭ 21 (-73.42%)
Mutual labels:  docker-machine
node-docker-share
Share local folders with a Docker Machine VM
Stars: ✭ 30 (-62.03%)
Mutual labels:  docker-machine
Data-Engineering-Projects
Personal Data Engineering Projects
Stars: ✭ 167 (+111.39%)
Mutual labels:  cassandra
vps
A laravel 5 package to easily create and maintain vps on digital ocean
Stars: ✭ 59 (-25.32%)
Mutual labels:  digital-ocean
vos whatsapp
vangav open source - whatsapp; generated using vangav backend:
Stars: ✭ 14 (-82.28%)
Mutual labels:  cassandra
cassandra4io
Asynchronous lightweight fs2 and cats.effect.IO wrapper under datastax cassandra 4.x driver with doobie-like syntax
Stars: ✭ 40 (-49.37%)
Mutual labels:  cassandra

DATA-PROCESSING-PIPELINE

Description

Build a powerful Real-Time Data Processing Pipeline & Visualization solution using Docker Machine and Compose, Kafka, Cassandra and Spark in 5 steps.

See below the project's architecture:

Docker Architecture

What's happening under the hood?

We connect to the twitter streaming API (https://dev.twitter.com/streaming/overview) and start to listen to events based on a list of keywords, these events are forwarded directly to Kafka (no parsing). In the middle, there is a spark job collecting those events, converting them to Spark SQL context (http://spark.apache.org/sql/) which filters the kafka message and extract only the fields of interest which in this case are: user.location, text and user.profile_image_url, once we have that, we convert the location into coordinates (lat,lng) using the google geoconding API (https://developers.google.com/maps/documentation/geocoding/intro) and persist the data into Cassandra.

Finally, there is a web application running that is fetching data from Cassandra and rendering the tweets of interest on the world map.

Project Screenshot

Some Interesting Project Stats

Number of Containers: 8
Number of Open Source Projects Used: 8
Number of Programming Languages Used: 4 (Python, Bash, Scala, Java)

Pre-requisites

Docker (https://docs.docker.com/installation/)

$ wget -qO- https://get.docker.com/ | sh

Docker Machine (https://docs.docker.com/machine/install-machine/)

$ curl -L https://github.com/docker/machine/releases/download/v0.8.2/docker-machine_`uname -s`-amd64 > /usr/local/bin/docker-machine
$ chmod +x /usr/local/bin/docker-machine

Docker Compose (https://docs.docker.com/compose/install/)

$ curl -L https://github.com/docker/compose/releases/download/1.8.1/docker-compose-`uname -s`-`uname -m` > /usr/local/bin/docker-compose
$ chmod +x /usr/local/bin/docker-compose

Project Installation / Usage

Step 1: Create a VM with Docker

If you already have a VM running or if you are on Linux, you can skip this step. Otherwise, the steps are the following:

On Digital Ocean

a) Create a Digital Ocean Token

You need to create a personal access token under “Apps & API” in the Digital Ocean Control Panel.

b) Grab your access token, then run docker-machine create with these details:
$ docker-machine create --driver digitalocean --digitalocean-access-token=<access token> Docker-VM

On VirtualBox

You just need to run:

$ docker-machine create -d virtualbox --virtualbox-memory 2048 Docker-VM

On Microsoft Azure

a) Create certificate

$ openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
$ openssl pkcs12 -export -out mycert.pfx -in mycert.pem -name "My Certificate"
$ openssl x509 -inform pem -in mycert.pem -outform der -out mycert.cer

b) Upload the certificate to Microft Azure

Go to the Azure portal, go to the “Settings” page (you can find the link at the bottom of the left sidebar - you need to scroll), then “Management Certificates” and upload mycert.cer.

c) Grab your subscription ID from the portal (SUBSCRIPTIONS tab), then run docker-machine create with these details:

$ docker-machine create -d azure --azure-subscription-id="SUB_ID" --azure-subscription-cert="mycert.pem" azure-size="Medium" Docker-VM

d) Expose Port 80

When viewing your VM in the resource group you've created, scroll down to click Endpoints to view the endpoints on the VM. Add a new endpoint that exposes the port 80 and give it some name.

Access the VM

By default docker-machine will spin up an Ubuntu 14.04 instance on all cloud providers, as we are running multiple JAVA based applications that consumes a lot of memory, on the docker-machine create commands above I added an extra parameter to reserve at least 2GB of memory. The command below will ssh into the VM using your ssh public key

$ docker-machine ssh Docker-VM

Step 2: Getting Twitter API keys

In order to access Twitter Streaming API, we need to get 4 pieces of information from Twitter: API key, API secret, Access token and Access token secret. Follow the steps below to get all 4 elements:

Create a twitter account if you do not already have one.
Go to https://apps.twitter.com/ and log in with your twitter credentials.
Click "Create New App"
Fill out the form, agree to the terms, and click "Create your Twitter application"
In the next page, click on "API keys" tab, and copy your "API key" and "API secret".
Scroll down and click "Create my access token", and copy your "Access token" and "Access token secret".

Step 3: Clone this repo and update the docker-compose.yml file (https://docs.docker.com/compose/yml/)

First you need to clone this repo:

$ git clone [email protected]:rogaha/data-processing-pipeline.git

Then, we need to update the kafka advertized host name, the twitter API credentials and the keywords you want to track. Below are the enviroment variables that need to be updated:

KAFKA_ADVERTISED_HOST_NAME: "" (public IP or the IP of your local VM)
ACCESS_TOKEN: ""
ACCESS_TOKEN_SECRET: ""
CONSUMER_KEY: ""
CONSUMER_SECRET: ""
KEYWORDS_LIST: ""
GOOGLE_GEOCODING_API_KEY: "." (use "." to ignore it)

In order to get the public IP of your Digital Ocean droplet you can run from the VM:

$ /sbin/ifconfig eth0 | grep 'inet addr:' | cut -d: -f2 | awk '{ print $1}'

The KEYWORDS_LIST shoud be a comma separated string, such as: "python, scala, golang"

Step 4: Start All the Containers

With docker-compose you can just run:

$ docker-compose up -d

The output should be:

Creating dataprocessingpipeline_zookeeper_1...
Creating dataprocessingpipeline_sparkmaster_1...
Creating dataprocessingpipeline_kafka_1...
Creating dataprocessingpipeline_twitterkafkaproducer_1...
Creating dataprocessingpipeline_cassandra_1...
Creating dataprocessingpipeline_sparkjob_1...
Creating dataprocessingpipeline_webserver_1...
Creating dataprocessingpipeline_sparkworker_1...

After that you should wait a few seconds, I've a 15 seconds delay before starting the spark-job, kafka producer and webcontainer containers, in order to make all the dependencies are up and running.

Step 5: Access the IP/Hostname of your machine from your browser

I've cloned this repo, updated the environment variables and started the containers on Azure.

Open Source Projects Used

Docker (https://github.com/docker/docker)

An open platform for distributed applications for developers and sysadmins

Docker Machine (https://github.com/docker/machine)

Lets you create Docker hosts on your computer, on cloud providers, and inside your own data center

Docker Compose (https://github.com/docker/compose)

Tool for defining and running multi-container applications with Docker

Apache Spark / Spark SQL (https://github.com/apache/spark)

A fast, in-memory data processing engine. Spark SQL lets you query structured data as a resilient distributed dataset (RDD)

Apache Kafka (https://github.com/apache/kafka)

A fast and scalable pub-sub messaging service

Apache Kookeeper (https://github.com/apache/zookeeper)

A distributed configuration service, synchronization service, and naming registry for large distributed systems

Apache Cassandra (https://github.com/apache/cassandra)

Scalable, high-available and distributed columnar NoSQL database

D3 (https://github.com/mbostock/d3)

A JavaScript visualization library for HTML and SVG.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].