All Projects → ayush1997 → Youtube Like Predictor

ayush1997 / Youtube Like Predictor

YouTube Like Count Predictions using Machine Learning

Projects that are alternatives of or similar to Youtube Like Predictor

Pandas Videos
Jupyter notebook and datasets from the pandas Q&A video series
Stars: ✭ 1,716 (+1152.55%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
Bayesian Cognitive Modeling In Pymc3
PyMC3 codes of Lee and Wagenmakers' Bayesian Cognitive Modeling - A Pratical Course
Stars: ✭ 93 (-32.12%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
Datacamp
🍧 A repository that contains courses I have taken on DataCamp
Stars: ✭ 69 (-49.64%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
Optimus
🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (+619.71%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
Loandefault Prediction
Lending Club Loan data analysis
Stars: ✭ 113 (-17.52%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
Data Science Lunch And Learn
Resources for weekly Data Science Lunch & Learns
Stars: ✭ 49 (-64.23%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
Hyperlearn
50% faster, 50% less RAM Machine Learning. Numba rewritten Sklearn. SVD, NNMF, PCA, LinearReg, RidgeReg, Randomized, Truncated SVD/PCA, CSR Matrices all 50+% faster
Stars: ✭ 1,204 (+778.83%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
Resources
PyMC3 educational resources
Stars: ✭ 930 (+578.83%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
Spark R Notebooks
R on Apache Spark (SparkR) tutorials for Big Data analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 109 (-20.44%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
Ml Da Coursera Yandex Mipt
Machine Learning and Data Analysis Coursera Specialization from Yandex and MIPT
Stars: ✭ 108 (-21.17%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
Datasist
A Python library for easy data analysis, visualization, exploration and modeling
Stars: ✭ 123 (-10.22%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
Seaborn Tutorial
This repository is my attempt to help Data Science aspirants gain necessary Data Visualization skills required to progress in their career. It includes all the types of plot offered by Seaborn, applied on random datasets.
Stars: ✭ 114 (-16.79%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
Pandas Profiling
Create HTML profiling reports from pandas DataFrame objects
Stars: ✭ 8,329 (+5979.56%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
25daysinmachinelearning
I will update this repository to learn Machine learning with python with statistics content and materials
Stars: ✭ 53 (-61.31%)
Mutual labels:  jupyter-notebook, data-science, random-forest
Data Science On Gcp
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Stars: ✭ 864 (+530.66%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
My Journey In The Data Science World
📢 Ready to learn or review your knowledge!
Stars: ✭ 1,175 (+757.66%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
Skdata
Python tools for data analysis
Stars: ✭ 16 (-88.32%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
Spring2017 proffosterprovost
Introduction to Data Science
Stars: ✭ 18 (-86.86%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+876.64%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
Pythondata
repo for code published on pythondata.com
Stars: ✭ 113 (-17.52%)
Mutual labels:  jupyter-notebook, data-science, data-analysis

YouTube Like Count Predictor

This a tool for getting youtube video like count prediction.A Random Forest model was used for training on a large dataset of ~3,50,000 videos.Feature engineering,Data cleaning, Data selection and many other techniques were used for this task.

Report

Report.pdf contains a detailed explanation of different steps and techniques that were used for this task.

Tools Used

How to run :

  1. Clone this repo

    $ git clone https://github.com/ayush1997/YouTube-Like-predictor.git
    $ cd PS17_Ayush_Singh
    
  2. Create new virtual environment

    $ sudo pip install virtualenv
    $ virtualenv venv
    $ source venv/bin/activate
    $ pip install -r requirements.txt
     ```
    
  3. Predictions

    There are two ways for getting the prediction results.

    3.1. Training the model and run prediction

    $ cd model
    $ python train_model.py
    

    This will save a model-final file in the same folder,Training takes ~18 Mins.Then run

    $ python predict.py <list of video ids>
    

    for ex: $ python predict.py dOyJqGtP-wU ASO_zypdnsQ wEduiMyl0ko

    3.2 From pretrained model

    A pretrained model has been uploaded on dropbox.Download model(~500MB) from the link.

    Unzip the model-final file in the model folder.

    $ cd model
    $ python predict.py <list of video ids>
    

    for ex: $ python predict.py vid1 vid2 vid3]

Note: List can contain a maximum of 40 Video IDs at the time of run.

Code Details

Below is a brief description for the Code files/folder in repo.

data/

This folder contains scripts which were used to fetch data using Youtube API and populatin the base.

$ cd data

get_IDS.py

The script uses Youtube Search API for extracting Video IDs for the last 7 years(2010-2016).It gives Approx. 22,000-24,000 Video IDs for every category and stores them in a Pickle files for different categories.

$ python predict.py <list of video ids>

scrape_video.py

The script use the Video IDs saved by get_IDS.py and further extract different video related attributes using Youtube API and saves the data Dictionary in pickle format.

$ python scrape_video.py

scrape_channel.py

The script is used to further collect data for all channels present in the video dataset.It makes use of the data stored for videos to extract channelIds.

$ python scrape_channel.py

scrape_social.py

The script is used to scrape social links

$ python scrape_social.py

Note : Due to large amount of data to be extracted for different attributes,the extraction was done at different levels therefore it was not viable to make a single script for data collection which could make debugging a little messy.

notebook/

This folder contains ipython notebooks which contain implementation for merging different data extracted and tasks like Data cleaning and processing.

$ jupyter notebook

FeatureEngineering.ipynb

The notebook has the implementation for making new derived features.

DataProcessing.ipynb

This notebook contains data processing implementation for data cleaning and encoding processes.

Note : The final data generated after all processing has been uploaded in dataset/data.csv. dataset/data_final.csv has the data which is used for training the model.

model/

This folders contains scripts used for training,tuning model and getting the prediction results.

model_grid.py

This script generates the tuned parameters for estimator using Grid Search and Cross Validation.

$ python model_grid.py

train_model.py

This script is used for training the model over training data ( dataset/data_final.csv ) Because of Bootstrap Sampling in random forest the results migght vary after every trainig process.

$ python train_model.py

predict.py

This script returns the Like count prediction along with the difference and the Error rate

$ cd model
$ python predict.py <list of video ids>

for ex: $ python predict.py [vid1,vid2,vid3]

Issues

A very common issue comes with the pickling process which sometime leads to loss of information and different results every time.

Report

1 2 3 4 5 6 7 8

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].