All Projects → alonsoir → Awesome Recommendation Engine

alonsoir / Awesome Recommendation Engine

Licence: apache-2.0
The purpose of this tiny project is to put things together with the know how that i learned from the course big data expert from formacionhadoop.com The idea is to show how to play with apache spark streaming, kafka,mongo, spark machine learning algorithms.

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Awesome Recommendation Engine

Szt Bigdata
深圳地铁大数据客流分析系统🚇🚄🌟
Stars: ✭ 826 (+1657.45%)
Mutual labels:  kafka, spark, mongodb
Freestyle
A cohesive & pragmatic framework of FP centric Scala libraries
Stars: ✭ 627 (+1234.04%)
Mutual labels:  kafka, spark
Dev Setup
macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.
Stars: ✭ 5,590 (+11793.62%)
Mutual labels:  spark, mongodb
Goodskill
🐂基于springcloud +dubbo构建的模拟秒杀项目,模块化设计,集成了分库分表、elasticsearch🔍、gateway、mybatis-plus、spring-session等常用开源组件
Stars: ✭ 786 (+1572.34%)
Mutual labels:  kafka, mongodb
All Things Cqrs
Comprehensive guide to a couple of possible ways of synchronizing two states with Spring tools. Synchronization is shown by separating command and queries in a simple CQRS application.
Stars: ✭ 474 (+908.51%)
Mutual labels:  kafka, mongodb
Sparta
Real Time Analytics and Data Pipelines based on Spark Streaming
Stars: ✭ 513 (+991.49%)
Mutual labels:  kafka, spark
Stream Reactor
Streaming reference architecture for ETL with Kafka and Kafka-Connect. You can find more on http://lenses.io on how we provide a unified solution to manage your connectors, most advanced SQL engine for Kafka and Kafka Streams, cluster monitoring and alerting, and more.
Stars: ✭ 753 (+1502.13%)
Mutual labels:  kafka, mongodb
Real Time Stock Market Prediction
In this repository, I have developed the entire server-side principal architecture for real-time stock market prediction with Machine Learning. I have used Tensorflow.js for constructing ml model architecture, and Kafka for real-time data streaming and pipelining.
Stars: ✭ 414 (+780.85%)
Mutual labels:  kafka, mongodb
Dockerfiles
50+ DockerHub public images for Docker & Kubernetes - Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak, TeamCity and DevOps tools built on the major Linux distros: Alpine, CentOS, Debian, Fedora, Ubuntu
Stars: ✭ 847 (+1702.13%)
Mutual labels:  kafka, spark
Bigdataguide
大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料
Stars: ✭ 817 (+1638.3%)
Mutual labels:  kafka, spark
Testcontainers Spring Boot
Container auto-configurations for spring-boot based integration tests
Stars: ✭ 460 (+878.72%)
Mutual labels:  kafka, mongodb
Real Time Stream Processing Engine
This is an example of real time stream processing using Spark Streaming, Kafka & Elasticsearch.
Stars: ✭ 37 (-21.28%)
Mutual labels:  kafka, spark
Bdp Dataplatform
大数据生态解决方案数据平台:基于大数据、数据平台、微服务、机器学习、商城、自动化运维、DevOps、容器部署平台、数据平台采集、数据平台存储、数据平台计算、数据平台开发、数据平台应用搭建的大数据解决方案。
Stars: ✭ 456 (+870.21%)
Mutual labels:  spark, mongodb
Mongo Spark
The MongoDB Spark Connector
Stars: ✭ 588 (+1151.06%)
Mutual labels:  spark, mongodb
God Of Bigdata
专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...
Stars: ✭ 6,008 (+12682.98%)
Mutual labels:  kafka, spark
Kafka Storm Starter
Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.
Stars: ✭ 728 (+1448.94%)
Mutual labels:  kafka, spark
Tutorial
Java全栈知识架构体系总结
Stars: ✭ 407 (+765.96%)
Mutual labels:  spark, mongodb
Agile data code 2
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
Stars: ✭ 413 (+778.72%)
Mutual labels:  kafka, spark
Springbootexamples
Spring Boot 学习教程
Stars: ✭ 794 (+1589.36%)
Mutual labels:  kafka, mongodb
Bigdata Interview
🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Stars: ✭ 857 (+1723.4%)
Mutual labels:  kafka, spark

Awesome Recommendation Engine

The purpose of this project is to practise the basics about how to code a almost near real time rating based recommendation engine. The simple idea is to calculate recomendations of different items using notes that other users have given to other products, recalculating as quickly as possible, that is, as soon as the note has arrived in the system.

These are the components:

  • A kafka producer is going to ask periodically to Amazon in order to know what products based on my own ratings and i am going to introduced them into some kafka topic.

  • A spark streaming process is going to read from that previous topic.

Steps

  • Apply some machine learning algorithms (ALS, content based filtering colaborative filtering) on those datasets readed by the spark streaming process.

  • Save results in a mongo or cassandra instance.

  • Use play framework to create an websocket interface between the mongo instance and the visual interface.

I am going to use some ideas from a previous work:

hello-kafka-twitter-scala

recomendation-spark-engine

Some Preliminaries

Actually the project can push data to kafka topic, the spark streaming process can recover data from the topic and save them into mongo instance. Attention, I've only had time to test it with kafka 0.8.1. To use it with later versions, you would have to edit the build.sbt file and change the version of the library. Open a PR in case there are problems and we see it.

To get kafka and zookeeper up and running, please follow the instructions on this website. https://kafka.apache.org/081/documentation.html#quickstart

To get a mongo instance up and running, please follow the instructions on this website. https://www.codecademy.com/articles/tdd-setup-mongodb-2

Then, you must create a mongo database instance and a mongo collection with the same data that you are providing in your src/main/resources/references.conf file. To do that, start the server with mongod daemon command, then open a mongo session just typing mongo in another terminal, next you have to run the next commands provided by the instructions provided on this website. https://www.tutorialspoint.com/mongodb/mongodb_create_database.htm https://www.tutorialspoint.com/mongodb/mongodb_create_collection.htm

Possible Troubleshooting with mongo

Maybe, as me, you are in trouble to run again the daemon after upgrading mongo. I found useful this thread in Stackoverflow. https://stackoverflow.com/questions/21448268/how-to-set-mongod-dbpath

How to build

The project uses sbt with pack support to build the unix style commands described aboved:

$ sbt clean pack
[info] Loading project definition from /Users/aironman/awesome-recommendation-engine/project
[info] Set current project to my-recommendation-spark-engine (in build file:/Users/aironman/awesome-recommendation-engine/)
[success] Total time: 0 s, completed 06-sep-2018 12:12:40
[info] Updating {file:/Users/aironman/awesome-recommendation-engine/}awesome-recommendation-engine...
[info] Resolving org.scala-lang#scalap;2.10.4 ...
[info] Done updating.
[warn] Scala version was updated by one of library dependencies:
[warn] 	* org.scala-lang:scala-compiler:2.10.0 -> 2.10.2
...
[info] Packaging /Users/aironman/awesome-recommendation-engine/target/scala-2.10/my-recommendation-spark-engine_2.10-1.0-SNAPSHOT.jar ...
[info] Done packaging.
[info] Creating a distributable package in target/pack
...
[info] Create a bin folder: target/pack/bin
[info] Generating launch scripts
[info] main class for twitter-producer: example.producer.TwitterProducer
[info] Generating target/pack/bin/twitter-producer
[info] Generating target/pack/bin/twitter-producer.bat
[info] main class for producer-stream-example: example.producer.ProducerStreamExample
[info] Generating target/pack/bin/producer-stream-example
[info] Generating target/pack/bin/producer-stream-example.bat
[info] main class for amazon-producer-example: example.producer.AmazonProducerExample
[info] Generating target/pack/bin/amazon-producer-example
[info] Generating target/pack/bin/amazon-producer-example.bat
[info] main class for direct-kafka-word-count: example.spark.DirectKafkaWordCount
[info] Generating target/pack/bin/direct-kafka-word-count
[info] Generating target/pack/bin/direct-kafka-word-count.bat
[info] main class for amazon-kafka-connector: example.spark.AmazonKafkaConnector
[info] Generating target/pack/bin/amazon-kafka-connector
[info] Generating target/pack/bin/amazon-kafka-connector.bat
[info] main class for kafka-connector: example.spark.KafkaConnector
[info] Generating target/pack/bin/kafka-connector
[info] Generating target/pack/bin/kafka-connector.bat
[info] packed resource directories = /Users/aironman/awesome-recommendation-engine/src/pack
[info] Generating target/pack/Makefile
[info] Generating target/pack/VERSION
[info] done.
[success] Total time: 61 s, completed 06-sep-2018 12:13:41

After running sbt clean pack within your source folder, you can see unix styled commands within the target/pack/bin folder.

$ ls
LICENSE			activator.properties	log-cleaner.log		ratings.csv		target
README.md		build.sbt		project			src
$ cd target/
$ ls
pack			resolution-cache	scala-2.10		streams
$ cd pack/
$ ls
Makefile	VERSION		bin		lib
$ ls bin/
amazon-kafka-connector		amazon-producer-example		direct-kafka-word-count		kafka-connector			producer-stream-example		twitter-producer
amazon-kafka-connector.bat	amazon-producer-example.bat	direct-kafka-word-count.bat	kafka-connector.bat		producer-stream-example.bat	twitter-producer.bat

Before running the commands, you will need a kafka node running with a topic of your choice. The next command, amazon-kafka-connector is running with a kafka node running in your localhost using the port 9092. The topic is amazonRatingsTopic.

Actual output:

$ ./amazon-kafka-connector 127.0.0.1:9092 amazonRatingsTopic
  Initializing Streaming Spark Context and kafka connector...
  Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
  16/05/16 18:48:49 INFO SparkContext: Running Spark version 1.6.1
  16/05/16 18:48:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  ...
  16/05/16 18:48:51 INFO VerifiableProperties: Verifying properties
  16/05/16 18:48:51 INFO VerifiableProperties: Property group.id is overridden to 
  16/05/16 18:48:51 INFO VerifiableProperties: Property zookeeper.connect is overridden to 
  Initialized Streaming Spark Context and kafka connector...
  Initializing mongodb connector...
  Initialized mongodb connector...
  Creating temporary table in mongo instance...
  16/05/16 18:48:52 INFO SparkContext: Starting job: show at AmazonKafkaConnectorWithMongo.scala:137
  16/05/16 18:48:53 INFO DAGScheduler: Got job 0 (show at AmazonKafkaConnectorWithMongo.scala:137) with 1 output partitions
  ...
  16/05/16 18:48:53 INFO DAGScheduler: Job 0 finished: show at AmazonKafkaConnectorWithMongo.scala:137, took 0,250144 s
  +--------------------+--------------------+
  |                  id|       amazonProduct|
  +--------------------+--------------------+
  |Mon May 16 18:41:...|[  null  , "{\"it...|
  |Mon May 16 18:42:...|[  null  , "{\"it...|
  |Mon May 16 18:45:...|[  null  , "{\"it...|
  +--------------------+--------------------+
16/05/16 18:48:53 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:57536 in memory (size: 2.5 KB, free: 2.4 GB)
  ...
  16/05/16 18:48:57 INFO StreamingContext: Invoking stop(stopGracefully=false) from shutdown hook
  16/05/16 18:48:57 INFO JobGenerator: Stopping JobGenerator immediately
  16/05/16 18:48:57 INFO RecurringTimer: Stopped timer for JobGenerator after time 1463417336000
  16/05/16 18:48:57 INFO JobGenerator: Stopped JobGenerator
  16/05/16 18:48:57 INFO JobScheduler: Stopped JobScheduler
  Finished!

Next, i am going to ask to amazon for a product using a productId. Of course, this should be a RESTful get petition or something like that, but i implemented a unix command.

$ ./amazon-producer-example 0981531679
Trying to parse product with id 0981531679
amazonProduct is AmazonProduct(0981531679,Scala Puzzlers,http://www.amazon.com/Scala-Puzzlers-Andrew-Phillips/dp/0981531679,http://ecx.images-amazon.com/images/I/41UHeor2AfL._SX218_BO1,204,203,200_QL40_.jpg,)
amazon product sent to kafka cluster...AmazonProduct(0981531679,Scala Puzzlers,http://www.amazon.com/Scala-Puzzlers-Andrew-Phillips/dp/0981531679,http://ecx.images-amazon.com/images/I/41UHeor2AfL._SX218_BO1,204,203,200_QL40_.jpg,)

Ok, lets see if the previous output is already saved in a mongo instance...

$ mongo
MongoDB shell version: 3.2.6
connecting to: test
> use alonsodb;
switched to db alonsodb
> db.amazonRatings.find()
{ "_id" : ObjectId("5739f84a8d6ab41037bbf32d"), "id" : ISODate("2016-05-16T16:41:46.183Z"), "amazonProduct" : [ null, "{\"itemId\":\"0981531679\",\"title\":\"Scala Puzzlers\",\"url\":\"http://www.amazon.com/Scala-Puzzlers-Andrew-Phillips/dp/0981531679\",\"img\":\"http://ecx.images-amazon.com/images/I/41UHeor2AfL._SX218_BO1,204,203,200_QL40_.jpg\",\"description\":\"\"}" ] }
{ "_id" : ObjectId("5739f8628d6ab41037bbf32e"), "id" : ISODate("2016-05-16T16:42:10.025Z"), "amazonProduct" : [ null, "{\"itemId\":\"0981531679\",\"title\":\"Scala Puzzlers\",\"url\":\"http://www.amazon.com/Scala-Puzzlers-Andrew-Phillips/dp/0981531679\",\"img\":\"http://ecx.images-amazon.com/images/I/41UHeor2AfL._SX218_BO1,204,203,200_QL40_.jpg\",\"description\":\"\"}" ] }
{ "_id" : ObjectId("5739f9308d6ab41037bbf32f"), "id" : ISODate("2016-05-16T16:45:36.021Z"), "amazonProduct" : [ null, "{\"itemId\":\"0981531679\",\"title\":\"Scala Puzzlers\",\"url\":\"http://www.amazon.com/Scala-Puzzlers-Andrew-Phillips/dp/0981531679\",\"img\":\"http://ecx.images-amazon.com/images/I/41UHeor2AfL._SX218_BO1,204,203,200_QL40_.jpg\",\"description\":\"\"}" ] }
> 

Yeah, i have data in the mongo instance, i should find if previous data exists with the same content, but that wasnt the most important thing to do, so i let it go for future rewrites.

The idea of the project comes from a course about bigdata technology that i received from formacionhadoop.com. I have to consolidate and to practice with scala, spark streaming, spark-ml, kafka and mongo. In a future, i would like to rewrite this project from scratch, with the microservices style, restful operations to interact with Amazon, packaging with docker images, kafka, spark streaming and spark-ml, latests versions.

Things to do:

  • Saving to mongo instance the results from ALS algorithm... DONE!
  • Rewrite the whole project using the microservice RESTful style
  • upgrade library versions
  • docker!
  • add more machine learning algorithms to improve the recommendations. Adding more flows to the machine learning workflow.

have fun in the process!

Thank you @emecas.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].