All Projects → vinta → Albedo

vinta / Albedo

Licence: mit
A recommender system for discovering GitHub repos, built with Apache Spark

Programming Languages

python
139335 projects - #7 most used programming language
scala
5932 projects

Projects that are alternatives of or similar to Albedo

Alink
Alink is the Machine Learning algorithm platform based on Flink, developed by the PAI team of Alibaba computing platform.
Stars: ✭ 2,936 (+1870.47%)
Mutual labels:  recommender-system, feature-engineering
Elastic Graph Recommender
Building recommenders with Elastic Graph!
Stars: ✭ 33 (-77.85%)
Mutual labels:  elasticsearch, recommender-system
Remixautoml
R package for automation of machine learning, forecasting, feature engineering, model evaluation, model interpretation, data generation, and recommenders.
Stars: ✭ 159 (+6.71%)
Mutual labels:  recommender-system, feature-engineering
Real Time Stream Processing Engine
This is an example of real time stream processing using Spark Streaming, Kafka & Elasticsearch.
Stars: ✭ 37 (-75.17%)
Mutual labels:  apache-spark, elasticsearch
NVTabular
NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Stars: ✭ 797 (+434.9%)
Mutual labels:  recommender-system, feature-engineering
Griffon Vm
Griffon Data Science Virtual Machine
Stars: ✭ 128 (-14.09%)
Mutual labels:  apache-spark, elasticsearch
Oryx
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
Stars: ✭ 1,785 (+1097.99%)
Mutual labels:  apache-spark
Eventflow
Async/await first CQRS+ES and DDD framework for .NET
Stars: ✭ 1,932 (+1196.64%)
Mutual labels:  elasticsearch
Sns Forum Website
牛客网高级项目(SNS+社区问答类网站)
Stars: ✭ 143 (-4.03%)
Mutual labels:  elasticsearch
Scalable Data Science
Scalable Data Science, course sets in big data Using Apache Spark over databricks and their mathematical, statistical and computational foundations using SageMath.
Stars: ✭ 142 (-4.7%)
Mutual labels:  apache-spark
Ncf
A pytorch implementation of He et al. "Neural Collaborative Filtering" at WWW'17
Stars: ✭ 149 (+0%)
Mutual labels:  recommender-system
Evalml
EvalML is an AutoML library written in python.
Stars: ✭ 145 (-2.68%)
Mutual labels:  feature-engineering
Sofang
基于Spring Boot+ElasticSearch实现搜房网
Stars: ✭ 146 (-2.01%)
Mutual labels:  elasticsearch
Json Logging Python
Python logging library to emit JSON log that can be easily indexed and searchable by logging infrastructure such as ELK, EFK, AWS Cloudwatch, GCP Stackdriver
Stars: ✭ 143 (-4.03%)
Mutual labels:  elasticsearch
Parquetviewer
Simple windows desktop application for viewing & querying Apache Parquet files
Stars: ✭ 145 (-2.68%)
Mutual labels:  apache-spark
Hydrograph
A visual ETL development and debugging tool for big data
Stars: ✭ 144 (-3.36%)
Mutual labels:  apache-spark
Elasticgeo
ElasticGeo provides a GeoTools data store that allows geospatial features from an Elasticsearch index to be published via OGC services using GeoServer.
Stars: ✭ 148 (-0.67%)
Mutual labels:  elasticsearch
Caiss
跨平台/多语言的 相似向量/相似词/相似句 高性能检索引擎。功能强大,使用方便。欢迎star & fork。Build together! Power another !
Stars: ✭ 142 (-4.7%)
Mutual labels:  recommender-system
Rival
RiVal recommender system evaluation toolkit
Stars: ✭ 145 (-2.68%)
Mutual labels:  recommender-system
Indigo
Universal cheminformatics libraries, utilities and database search tools
Stars: ✭ 146 (-2.01%)
Mutual labels:  elasticsearch

Albedo

A recommender system for discovering GitHub repos, built with Apache Spark.

Albedo is a fictional character in Dan Simmons's Hyperion Cantos series. Councilor Albedo is the TechnoCore's AI advisor to the Hegemony of Man.

Setup

$ git clone https://github.com/vinta/albedo.git
$ cd albedo
$ make up

Collect Data

You need to create your own GITHUB_PERSONAL_TOKEN on your GitHub settings page.

# get into the main container
$ make attach

# this step might take a few hours to complete
# depends on how many repos you starred and how many users you followed
$ (container) python manage.py migrate
$ (container) python manage.py collect_data -t GITHUB_PERSONAL_TOKEN -u GITHUB_USERNAME
# or
$ (container) wget https://s3-ap-northeast-1.amazonaws.com/files.albedo.one/albedo.sql
$ (container) mysql -h mysql -u root -p123 albedo < albedo.sql

# username: albedo
# password: hyperion
$ make run
$ open http://127.0.0.1:8000/admin/

Start a Spark Cluster

You could also create a Spark cluster on Google Cloud Dataproc.

# start a local Spark cluster in Standalone mode
$ make spark_start

Use Popularity as the Recommendation Baseline

See PopularityRecommenderBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.PopularityRecommenderTrainer \
    target/albedo-1.0.0-SNAPSHOT.jar
# [email protected] = 0.002017744675282716

Build the User Profile for Feature Engineering

See UserProfileBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.UserProfileBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar

Build the Item Profile for Feature Engineering

See RepoProfileBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.RepoProfileBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar

Train an ALS Model for Candidate Generation

See ALSRecommenderBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.ALSRecommenderBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar
# [email protected] = 0.05209047292612741

Build a Content-based Recommender for Candidate Generation

Elasticsearch's More Like This API will do the tricks.

$ (container) python manage.py sync_data_to_es

See ContentRecommenderBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,org.apache.httpcomponents:httpclient:4.5.2,org.elasticsearch.client:elasticsearch-rest-high-level-client:5.6.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.ContentRecommenderBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar
# [email protected] = 0.002559563451967487

Train a Word2Vec Model for Text Vectorization

See Word2VecCorpusBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,com.hankcs:hanlp:portable-1.3.4,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.Word2VecCorpusBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar

Train a Logistic Regression Model for Ranking

See LogisticRegressionRanker.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,com.hankcs:hanlp:portable-1.3.4,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.LogisticRegressionRanker \
    target/albedo-1.0.0-SNAPSHOT.jar
# [email protected] = 0.021114356461615493

TODO

  • Build a recommender system with Spark: Factorization Machine
  • Build a recommender system with Spark: GDBT for Feature Learning
  • Build a recommender system with Spark: Item2Vec
  • Build a recommender system with Spark: PageRank and GraphX
  • Build a recommender system with Spark: XGBoost

Related Posts

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].