All Projects → titipata → Yelp_dataset_challenge

titipata / Yelp_dataset_challenge

Play around with Yelp dataset in Python (in progress and very messy repo)

Programming Languages

python
139335 projects - #7 most used programming language

Labels

Projects that are alternatives of or similar to Yelp dataset challenge

Just Pandas Things
An ongoing list of pandas quirks
Stars: ✭ 660 (+4300%)
Mutual labels:  pandas
Foxcross
AsyncIO serving for data science models
Stars: ✭ 18 (+20%)
Mutual labels:  pandas
Phildb
Timeseries database
Stars: ✭ 25 (+66.67%)
Mutual labels:  pandas
Machine Learning
머신러닝 입문자 혹은 스터디를 준비하시는 분들에게 도움이 되고자 만든 repository입니다. (This repository is intented for helping whom are interested in machine learning study)
Stars: ✭ 705 (+4600%)
Mutual labels:  pandas
Dataframe
C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types, continuous memory storage, and no pointers are involved
Stars: ✭ 828 (+5420%)
Mutual labels:  pandas
Finta
Common financial technical indicators implemented in Pandas.
Stars: ✭ 901 (+5906.67%)
Mutual labels:  pandas
Pyjanitor
Clean APIs for data cleaning. Python implementation of R package Janitor
Stars: ✭ 647 (+4213.33%)
Mutual labels:  pandas
Pandas Profiling
Create HTML profiling reports from pandas DataFrame objects
Stars: ✭ 8,329 (+55426.67%)
Mutual labels:  pandas
Lux
Python API for Intelligent Visual Data Discovery
Stars: ✭ 787 (+5146.67%)
Mutual labels:  pandas
S3bp
Read and write Python objects to S3, caching them on your hard drive to avoid unnecessary IO.
Stars: ✭ 24 (+60%)
Mutual labels:  pandas
Fecon235
Notebooks for financial economics. Keywords: Jupyter notebook pandas Federal Reserve FRED Ferbus GDP CPI PCE inflation unemployment wage income debt Case-Shiller housing asset portfolio equities SPX bonds TIPS rates currency FX euro EUR USD JPY yen XAU gold Brent WTI oil Holt-Winters time-series forecasting statistics econometrics
Stars: ✭ 708 (+4620%)
Mutual labels:  pandas
Pandas exercises
Practice your pandas skills!
Stars: ✭ 7,140 (+47500%)
Mutual labels:  pandas
Boltzmannclean
Fill missing values in Pandas DataFrames using Restricted Boltzmann Machines
Stars: ✭ 23 (+53.33%)
Mutual labels:  pandas
Jdata
京东JData算法大赛-高潜用户购买意向预测入门程序(starter code)
Stars: ✭ 662 (+4313.33%)
Mutual labels:  pandas
Disatbot
DABOT: Disaster Attention Bot
Stars: ✭ 26 (+73.33%)
Mutual labels:  pandas
Pingouin
Statistical package in Python based on Pandas
Stars: ✭ 651 (+4240%)
Mutual labels:  pandas
Quickviz
Visualize a pandas dataframe in a few clicks
Stars: ✭ 18 (+20%)
Mutual labels:  pandas
Numsharp
High Performance Computation for N-D Tensors in .NET, similar API to NumPy.
Stars: ✭ 882 (+5780%)
Mutual labels:  pandas
Pyda 2e Zh
📖 [译] 利用 Python 进行数据分析 · 第 2 版
Stars: ✭ 866 (+5673.33%)
Mutual labels:  pandas
Python Introducing Pandas
Introduction to pandas Treehouse course
Stars: ✭ 24 (+60%)
Mutual labels:  pandas

Yelp Dataset Challenge for Python

Repository for reading and downloading Yelp Dataset Challenge round 6 in Pandas pickle format. This repository makes it easy for anyone who want to mess around with Yelp data using Python. I provide yelp_util Python package that has read and download function.

Datasets repository

The following is structure of S3,

science-of-science-bucket
└─yelp_academic_dataset
  ├───yelp_academic_dataset_business.pickle (61k rows)
  ├───yelp_academic_dataset_review.pickle (1.5M rows)
  ├───yelp_academic_dataset_user.pickle (366k rows)
  ├───yelp_academic_dataset_checkin.pickle (45k rows)
  └───yelp_academic_dataset_tip.pickle (495k rows)

You can download data directly from AWS S3 repository as follows,

import yelp_util
yelp_util.download(file_list=["yelp_academic_dataset_business.pickle",
                              "yelp_academic_dataset_review.pickle",
                              "yelp_academic_dataset_user.pickle",
                              "yelp_academic_dataset_checkin.pickle",
                              "yelp_academic_dataset_tip.pickle"])

The file will be downloaded to data folder. After finishing download, you can simply read pickle as follows

import pandas as pd
review = pd.read_pickle('data/yelp_academic_dataset_review.pickle')
review.head()

Structure of Datasets

User table of user's information (366k rows)

average_stars compliments elite fans friends name review_count type user_id votes yelping_since

Business table of business with its location and city that it locates (61k rows)

attributes business_id categories city full_address hours latitude longitude name neighborhoods open review_count stars state type

Review reviews made by users (1.5M rows)

business_id date review_id stars text type user_id type votes_cool votes_funny votes_useful

Checkin check-in table (45k rows)

business_id checkin_info type

Tip tip table (495k rows)

business_id date likes text type user_id

Cluster businesses according to how they are tagged

Read the business data

from sklearn.cluster import KMeans

business = pd.read_pickle('data/yelp_academic_dataset_business.pickle')
tags = business.categories.tolist()

then transform tags to matrix count

tag_countmatrix = yelp_util.taglist_to_matrix(tags)

This can be used to cluster businesses

from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)
km.fit(tag_countmatrix)
business['cluster'] = km.predict(tag_countmatrix)

Train word2vec model

review = pd.read_pickle('data/yelp_academic_dataset_review.pickle')
yelp_review_sample = list(review.text.iloc[10000:20000])
model = yelp_util.create_word2vec_model(yelp_review_sample) # word2vec model

Django runserver

All django project is in random_reviews folder. Get started by running python manage.py migrate. Then for local computer (main aim is to custom css files) run Django project by using python manage.py runserver

Dependencies

Members

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].