Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

R package for automation of machine learning, forecasting, feature engineering, model evaluation, model interpretation, data generation, and recommenders.

Stars: ✭ 159 (+6.71%)

Mutual labels: recommender-system, feature-engineering

Real Time Stream Processing Engine

This is an example of real time stream processing using Spark Streaming, Kafka & Elasticsearch.

Stars: ✭ 37 (-75.17%)

Mutual labels: apache-spark, elasticsearch

NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

Stars: ✭ 797 (+434.9%)

Mutual labels: recommender-system, feature-engineering

Griffon Vm

Griffon Data Science Virtual Machine

Stars: ✭ 128 (-14.09%)

Mutual labels: apache-spark, elasticsearch

Oryx

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Stars: ✭ 1,785 (+1097.99%)

Mutual labels: apache-spark

Eventflow

Async/await first CQRS+ES and DDD framework for .NET

Stars: ✭ 1,932 (+1196.64%)

Mutual labels: elasticsearch

Sns Forum Website

牛客网高级项目（SNS+社区问答类网站）

Stars: ✭ 143 (-4.03%)

Mutual labels: elasticsearch

Scalable Data Science

Scalable Data Science, course sets in big data Using Apache Spark over databricks and their mathematical, statistical and computational foundations using SageMath.

Stars: ✭ 142 (-4.7%)

Mutual labels: apache-spark

Ncf

A pytorch implementation of He et al. "Neural Collaborative Filtering" at WWW'17

Stars: ✭ 149 (+0%)

Mutual labels: recommender-system

Evalml

EvalML is an AutoML library written in python.

Stars: ✭ 145 (-2.68%)

Mutual labels: feature-engineering

Sofang

基于Spring Boot+ElasticSearch实现搜房网

Stars: ✭ 146 (-2.01%)

Mutual labels: elasticsearch

Json Logging Python

Python logging library to emit JSON log that can be easily indexed and searchable by logging infrastructure such as ELK, EFK, AWS Cloudwatch, GCP Stackdriver

Stars: ✭ 143 (-4.03%)

Mutual labels: elasticsearch

Parquetviewer

Simple windows desktop application for viewing & querying Apache Parquet files

Stars: ✭ 145 (-2.68%)

Mutual labels: apache-spark

Hydrograph

A visual ETL development and debugging tool for big data

Stars: ✭ 144 (-3.36%)

Mutual labels: apache-spark

Elasticgeo

ElasticGeo provides a GeoTools data store that allows geospatial features from an Elasticsearch index to be published via OGC services using GeoServer.

Stars: ✭ 148 (-0.67%)

Mutual labels: elasticsearch

Caiss

跨平台/多语言的相似向量/相似词/相似句高性能检索引擎。功能强大，使用方便。欢迎star & fork。Build together! Power another !

Stars: ✭ 142 (-4.7%)

Mutual labels: recommender-system

Rival

RiVal recommender system evaluation toolkit

Stars: ✭ 145 (-2.68%)

Mutual labels: recommender-system

Indigo

Universal cheminformatics libraries, utilities and database search tools

Stars: ✭ 146 (-2.01%)

Mutual labels: elasticsearch

View All Similar Projects ➔

Albedo

A recommender system for discovering GitHub repos, built with Apache Spark.

Albedo is a fictional character in Dan Simmons's Hyperion Cantos series. Councilor Albedo is the TechnoCore's AI advisor to the Hegemony of Man.

Setup

$ git clone https://github.com/vinta/albedo.git
$ cd albedo
$ make up

Collect Data

You need to create your own GITHUB_PERSONAL_TOKEN on your GitHub settings page.

# get into the main container
$ make attach

# this step might take a few hours to complete
# depends on how many repos you starred and how many users you followed
$ (container) python manage.py migrate
$ (container) python manage.py collect_data -t GITHUB_PERSONAL_TOKEN -u GITHUB_USERNAME
# or
$ (container) wget https://s3-ap-northeast-1.amazonaws.com/files.albedo.one/albedo.sql
$ (container) mysql -h mysql -u root -p123 albedo < albedo.sql

# username: albedo
# password: hyperion
$ make run
$ open http://127.0.0.1:8000/admin/

Start a Spark Cluster

You could also create a Spark cluster on Google Cloud Dataproc.

# start a local Spark cluster in Standalone mode
$ make spark_start

Use Popularity as the Recommendation Baseline

See PopularityRecommenderBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.PopularityRecommenderTrainer \
    target/albedo-1.0.0-SNAPSHOT.jar
# [email protected] = 0.002017744675282716

Build the User Profile for Feature Engineering

See UserProfileBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.UserProfileBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar

Build the Item Profile for Feature Engineering

See RepoProfileBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.RepoProfileBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar

Train an ALS Model for Candidate Generation

See ALSRecommenderBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.ALSRecommenderBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar
# [email protected] = 0.05209047292612741

Build a Content-based Recommender for Candidate Generation

Elasticsearch's More Like This API will do the tricks.

$ (container) python manage.py sync_data_to_es

See ContentRecommenderBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,org.apache.httpcomponents:httpclient:4.5.2,org.elasticsearch.client:elasticsearch-rest-high-level-client:5.6.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.ContentRecommenderBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar
# [email protected] = 0.002559563451967487

Train a Word2Vec Model for Text Vectorization

See Word2VecCorpusBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,com.hankcs:hanlp:portable-1.3.4,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.Word2VecCorpusBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar

Train a Logistic Regression Model for Ranking

See LogisticRegressionRanker.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,com.hankcs:hanlp:portable-1.3.4,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.LogisticRegressionRanker \
    target/albedo-1.0.0-SNAPSHOT.jar
# [email protected] = 0.021114356461615493

TODO

Build a recommender system with Spark: Factorization Machine
Build a recommender system with Spark: GDBT for Feature Learning
Build a recommender system with Spark: Item2Vec
Build a recommender system with Spark: PageRank and GraphX
Build a recommender system with Spark: XGBoost

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 149

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

vinta / Albedo

Programming Languages

Labels

Projects that are alternatives of or similar to Albedo

Albedo

Setup

Collect Data

Start a Spark Cluster

Use Popularity as the Recommendation Baseline

Build the User Profile for Feature Engineering

Build the Item Profile for Feature Engineering

Train an ALS Model for Candidate Generation

Build a Content-based Recommender for Candidate Generation

Train a Word2Vec Model for Text Vectorization

Train a Logistic Regression Model for Ranking

TODO

Related Posts