All Projects → RUCAIBox → RecSysDatasets

RUCAIBox / RecSysDatasets

Licence: other
This is a repository of public data sources for Recommender Systems (RS).

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to RecSysDatasets

recommender system with Python
recommender system tutorial with Python
Stars: ✭ 106 (-61.03%)
Mutual labels:  recommendations, recommender-system
session4rec
GRu4Rec in TensorFlow
Stars: ✭ 14 (-94.85%)
Mutual labels:  recommender-system
recsys2019
The complete code and notebooks used for the ACM Recommender Systems Challenge 2019
Stars: ✭ 26 (-90.44%)
Mutual labels:  recommender-system
BPR MPR
BPR, Bayesian Personalized Ranking (BPR), extremely convenient BPR & Multiple Pairwise Ranking
Stars: ✭ 77 (-71.69%)
Mutual labels:  recommender-system
bpr
Bayesian Personalized Ranking using PyTorch
Stars: ✭ 105 (-61.4%)
Mutual labels:  recommender-system
Recommender-System
In this code we implement and compared Collaborative Filtering algorithm, prediction algorithms such as neighborhood methods, matrix factorization-based ( SVD, PMF, SVD++, NMF), and many others.
Stars: ✭ 30 (-88.97%)
Mutual labels:  recommender-system
TIFUKNN
kNN-based next-basket recommendation
Stars: ✭ 38 (-86.03%)
Mutual labels:  recommender-system
MARank
Multi-order Attentive Ranking Model for Sequential Recommendation
Stars: ✭ 25 (-90.81%)
Mutual labels:  recommender-system
skywalkR
code for Gogleva et al manuscript
Stars: ✭ 28 (-89.71%)
Mutual labels:  recommender-system
Translation-based-Recommendation
Sequential recommendation algorithm
Stars: ✭ 24 (-91.18%)
Mutual labels:  recommender-system
music-recommendation-system
A simple Music Recommendation System
Stars: ✭ 38 (-86.03%)
Mutual labels:  recommender-system
NeuralCitationNetwork
Neural Citation Network for Context-Aware Citation Recommendation (SIGIR 2017)
Stars: ✭ 24 (-91.18%)
Mutual labels:  recommender-system
rec-a-sketch
content discovery... IN 3D
Stars: ✭ 45 (-83.46%)
Mutual labels:  recommender-system
Awesome-Machine-Learning-Papers
📖Notes and remarks on Machine Learning related papers
Stars: ✭ 35 (-87.13%)
Mutual labels:  recommender-system
Recommender-Systems-with-Collaborative-Filtering-and-Deep-Learning-Techniques
Implemented User Based and Item based Recommendation System along with state of the art Deep Learning Techniques
Stars: ✭ 41 (-84.93%)
Mutual labels:  recommender-system
Yue
A python library for music recommendation
Stars: ✭ 88 (-67.65%)
Mutual labels:  recommender-system
MoHR
MoHR: Recommendation Through Mixtures of Heterogeneous Item Relationships
Stars: ✭ 51 (-81.25%)
Mutual labels:  recommender-system
Friends-Recommender-In-Social-Network
Friends Recommendation and Link Prediction in Social Netowork
Stars: ✭ 33 (-87.87%)
Mutual labels:  recommender-system
fun-rec
推荐系统入门教程,在线阅读地址:https://datawhalechina.github.io/fun-rec/
Stars: ✭ 1,367 (+402.57%)
Mutual labels:  recommender-system
Auto-Surprise
An AutoRecSys library for Surprise. Automate algorithm selection and hyperparameter tuning 🚀
Stars: ✭ 19 (-93.01%)
Mutual labels:  recommender-system

Datasets For Recommender Systems

This is a repository of public data sources for Recommender Systems (RS).

All of these recommendation datasets can convert to the atomic files defined in RecBole which is a unified, comprehensive and efficient recommendation library.

After converting to the atomic files, you can use RecBole to test the performance of different recommender models on these datasets easily. For more information about RecBole, please refer to RecBole.

Usage

In order to use RecBole, you need to convert these original datasets to the atomic file which is a kind of data format defined by RecBole.

We provide two ways to convert these datasets into atomic files:

  1. Download the raw dataset and process it with conversion tools we provide in this repository. Please refer to conversion tools.

  2. Directly download the processed atomic files. Baidu Wangpan (Password: e272), Google Drive.

Datasets link and brief introduction

Shopping

  • Amazon: This dataset contains product reviews, only-rating data (ratings) and metadata(descriptions, category information, price, brand, and image features) from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.
  • Epinions: This dataset was collected from Epinions.com, a popular online consumer review website. It contains trust relationships amongst users and spans more than a decade, from January 2001 to November 2013.
  • Yelp: This dataset was collected from Yelp.com. The Yelp dataset is a subset of our businesses, reviews, and user data for use in personal, educational, and academic purposes.
  • Tmall: This dataset is provided by Ant Financial Services, using in the IJCAI16 contest.
  • DIGINETICA: The dataset includes user sessions extracted from an e-commerce search engine logs, with anonymized user ids, hashed queries, hashed query terms, hashed product descriptions and meta-data, log-scaled prices, clicks, and purchases.
  • YOOCHOOSE: This dataset was constructed by YOOCHOOSE GmbH to support participants in the RecSys Challenge 2015.
  • Retailrocket: The data has been collected from a real-world ecommerce website. It is raw data, i.e. without any content transformations, however, all values are hashed due to confidential issues.
  • Ta Feng: The dataset contains a Chinese grocery store transaction data from November 2000 to February 2001.

Advertising

  • Criteo: This dataset was collected from Criteo, which consists of a portion of Criteo's traffic over a period of several days.
  • Avazu: This dataset is used in Avazu CTR prediction contest.
  • iPinYou: This dataset was provided by iPinYou, which contains all training datasets and leaderboard testing datasets of the three seasons iPinYou Global RTB(Real-Time Bidding) Bidding Algorithm Competition.

Check-in

  • Foursquare: This dataset contains check-ins in NYC and Tokyo collected for about 10 month. Each check-in is associated with its time stamp, its GPS coordinates and its semantic meaning.
  • Gowalla: This dataset is from a location-based social networking website where users share their locations by checking-in, and contains a total of 6,442,890 check-ins of these users over the period of Feb. 2009 - Oct. 2010.

Movies

  • MovieLens: GroupLens Research has collected and made available rating datasets from their movie web site.
  • Netflix: This is the official data set used in the Netflix Prize competition.
  • Douban: Douban Movie is a Chinese website that allows Internet users to share their comments and viewpoints about movies. This dataset contains more than 2 million short comments of 28 movies in Douban Movie website.

Music

  • Last.FM: This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system.
  • LFM-1b: This dataset contains more than one billion music listening events created by more than 120,000 users of Last.FM. Each listening event is characterized by artist, album, and track name, and includes a timestamp.
  • Yahoo Music: This dataset represents a snapshot of the Yahoo! Music community's preferences for various musical artists.

Books

  • Book-Crossing: This dataset was collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. It contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.

Games

  • Steam: This dataset is reviews and game information from Steam, which contains 7,793,069 reviews, 2,567,538 users, and 32,135 games. In addition to the review text, the data also includes the users' play hours in each review.

Anime

  • Anime: This dataset contains information on user preference data from myanimelist.net. Each user is able to add anime to their completed list and give it a rating and this dataset is a compilation of those ratings.

Pictures

  • Pinterest: This dataset is originally constructed by paper Learning image and user features for recommendations in social networks for evaluating content-based image recommendation, and processed by paper Neural Collaborative Filtering.

Jokes

  • Jester: This dataset contains anonymous ratings of jokes by users of the Jester Joke Recommender System.

Exercises

  • KDD2010: This dataset was released in KDD Cup 2010 Educational Data Mining Challenge, which contains the situations of students submitting exercises on the systems.

Websites

  • Phishing Websites: This dataset contains 30 kinds of features of 11,055 websites and labels of whether they are phishing websites or not. The websites' features includes 12 address-bar based features, 6 abnormal based features, 5 HTML-and-JavaScript based features and 7 domain based features.

Adult

  • Adult: This dataset is extracted by Barry Becker from the 1994 Census database, which consists of a list of people's attributes and whether they make over 50k a year.

News

  • MIND This dataset is a large-scale dataset for news recommendation research. It was collected from anonymized behavior logs of Microsoft News website. MIND contains about 160k English news articles and more than 15 million impression logs generated by 1 million users.

Datasets information statistics

General Datasets

SN Dataset #User #Item #Inteaction Sparsity Interaction Type TimeStamp User Context Item Context Interaction Context
1 MovieLens - - - - Rating
2 Anime 73,515 11,200 7,813,737 99.05% Rating
[-1, 1-10]
3 Epinions 116,260 41,269 188,478 99.99% Rating
[1-5]
4 Yelp 1,968,703 209,393 8,021,122 99.99% Rating
[1-5]
5 Netflix 480,189 17,770 100,480,507 98.82% Rating
[1-5]
6 Book-Crossing 105,284 340,557 1,149,780 99.99% Rating
[0-10]
7 Jester 73,421 101 4,136,360 44.22% Rating
[-10, 10]
8 Douban 738,701 28 2,125,056 89.73% Rating
[0,5]
9 Yahoo Music 1,948,882 98,211 11,557,943 99.99% Rating
[0, 100]
10 KDD2010 - - - - Rating
11 Amazon - - - - Rating
12 Pinterest 55,187 9,911 1,445,622 99.74% -
13 Gowalla 107,092 1,280,969 6,442,892 99.99% Check-in
14 Last.FM 1,892 17,632 92,834 99.72% Click
15 DIGINETICA 204,789 184,047 993,483 99.99% Click
16 Steam 2,567,538 32,135 7,793,069 99.99% Buy
17 Ta Feng 32,266 23,812 817,741 99.89% Click
18 Foursquare - - - - Check-in
19 Tmall 963,923 2,353,207 44,528,127 99.99% Click/Buy
20 YOOCHOOSE 9,249,729 52,739 34,154,697 99.99% Click/Buy
21 Retailrocket 1,407,580 247,085 2,756,101 99.99% View/Addtocart/Transaction
22 LFM-1b 120,322 3,123,496 1,088,161,692 99.71% Click
23 MIND - - - - Click

CTR Datasets

SN Dataset #User #Item #Inteaction Sparsity Interaction Type TimeStamp User Context Item Context Interaction Context
1 Criteo - - 45,850,617 - Click
2 Avazu - - 40,428,967 - Click
[0, 1]
3 iPinYou 19,731,660 163 24,637,657 99.23% View/Click
4 Phishing websites - - 11,055 -
5 Adult - - 32,561 - income>=50k
[0, 1]

Knowledge-aware Datasets

These knowledge-aware recommender datasets are based on KB4Rec, which associate items from recommender systems with entities from Freebase.

Raw datasets information

SN Dataset #Items #Linked-Items #Users #Interactions
1 MovieLens 27,278 25,503 138,493 20,000,263
2 Amazon-book 2,370,605 108,515 8,026,324 22,507,155
3 LFM-1b (tracks) 31,634,450 1,254,923 120,322 319,951,294

After filtering by 5-core (And filter out the tracks that are listened to less than 10 times in LFM-1b)

SN Dataset #Items #Linked-Items #Users #Interactions
1 MovieLens 18,345 18,057 138,493 19,984,024
2 Amazon-book 367,982 34,476 603,668 8,898,041
3 LFM-1b (tracks) 615,823 337,349 79,133 15,765,756
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].