All Projects → Doodies → Github-Stars-Predictor

Doodies / Github-Stars-Predictor

Licence: MIT license
It's a github repo star predictor that tries to predict the stars of any github repository having greater than 100 stars.

Programming Languages

Jupyter Notebook
11667 projects
javascript
184084 projects - #8 most used programming language
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Github-Stars-Predictor

datascienv
datascienv is package that helps you to setup your environment in single line of code with all dependency and it is also include pyforest that provide single line of import all required ml libraries
Stars: ✭ 53 (+55.88%)
Mutual labels:  seaborn, xgboost, matplotlib, catboost
Awesome Decision Tree Papers
A collection of research papers on decision, classification and regression trees with implementations.
Stars: ✭ 1,908 (+5511.76%)
Mutual labels:  random-forest, xgboost, catboost
aws-machine-learning-university-dte
Machine Learning University: Decision Trees and Ensemble Methods
Stars: ✭ 119 (+250%)
Mutual labels:  random-forest, xgboost, catboost
Track-Stargazers
Have fun tracking your project's stargazers
Stars: ✭ 38 (+11.76%)
Mutual labels:  github-api, github-stars
Tensorflow Ml Nlp
텐서플로우와 머신러닝으로 시작하는 자연어처리(로지스틱회귀부터 트랜스포머 챗봇까지)
Stars: ✭ 176 (+417.65%)
Mutual labels:  random-forest, xgboost
stellar
Search your github stars in R
Stars: ✭ 24 (-29.41%)
Mutual labels:  github-api, github-stars
decision-trees-for-ml
Building Decision Trees From Scratch In Python
Stars: ✭ 61 (+79.41%)
Mutual labels:  random-forest, xgboost
Awesome Github
A curated list of awesome GitHub guides, articles, sites, tools, projects and resources. 收集这个列表,只是为了更好地使用GitHub,欢迎提交pr和issue。
Stars: ✭ 1,962 (+5670.59%)
Mutual labels:  github-api, github-stars
AIML-Projects
Projects I completed as a part of Great Learning's PGP - Artificial Intelligence and Machine Learning
Stars: ✭ 85 (+150%)
Mutual labels:  random-forest, catboost
STOCK-RETURN-PREDICTION-USING-KNN-SVM-GUASSIAN-PROCESS-ADABOOST-TREE-REGRESSION-AND-QDA
Forecast stock prices using machine learning approach. A time series analysis. Employ the Use of Predictive Modeling in Machine Learning to Forecast Stock Return. Approach Used by Hedge Funds to Select Tradeable Stocks
Stars: ✭ 94 (+176.47%)
Mutual labels:  random-forest, prediction
cqr
Conformalized Quantile Regression
Stars: ✭ 152 (+347.06%)
Mutual labels:  random-forest, prediction
Data-Analytics-Projects
This repository contains the projects related to data collecting, assessing,cleaning,visualizations and analyzing
Stars: ✭ 167 (+391.18%)
Mutual labels:  seaborn, matplotlib
Machine Learning With Python
Practice and tutorial-style notebooks covering wide variety of machine learning techniques
Stars: ✭ 2,197 (+6361.76%)
Mutual labels:  random-forest, matplotlib
Benchm Ml
A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
Stars: ✭ 1,835 (+5297.06%)
Mutual labels:  random-forest, xgboost
github-interact-cli
🎩 Interact with GItHub right inside your terminal
Stars: ✭ 43 (+26.47%)
Mutual labels:  github-api, github-stars
Machine Learning In R
Workshop (6 hours): preprocessing, cross-validation, lasso, decision trees, random forest, xgboost, superlearner ensembles
Stars: ✭ 144 (+323.53%)
Mutual labels:  random-forest, xgboost
Python-Course
Python Basics, Machine Learning and Deep Learning
Stars: ✭ 50 (+47.06%)
Mutual labels:  seaborn, matplotlib
stackgbm
🌳 Stacked Gradient Boosting Machines
Stars: ✭ 24 (-29.41%)
Mutual labels:  xgboost, catboost
Tpot
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
Stars: ✭ 8,378 (+24541.18%)
Mutual labels:  random-forest, xgboost
Predicting real estate prices using scikit Learn
Predicting Amsterdam house / real estate prices using Ordinary Least Squares-, XGBoost-, KNN-, Lasso-, Ridge-, Polynomial-, Random Forest-, and Neural Network MLP Regression (via scikit-learn)
Stars: ✭ 78 (+129.41%)
Mutual labels:  random-forest, xgboost

Github Repo Stars Predictor

Overview

It's a github repo star predictor that tries to predict the stars of any github repository having greater than 100 stars. It predicts based on the owner/organization's status and activities (commits, forks, comments, branches, update rate, etc.) on the repository. Different types of models (Gradient boost, Deep neural network, etc) have been tested successfully on the dataset we fetched from github apis.

Dataset

We used the github REST api and GraphQL api to collect data of repositories having more than 100 stars. The data is available in the dataset directory We were able to collect the data faster using the Digital Ocean's multiple servers. So we thanks Digital Ocean for providing free credits to students to use servers. For the details on dataset features refer the summary section below.

Tools used

  • Python 2.7
  • Jupyter Notebook
  • NumPy
  • Sklearn
  • Pandas
  • Keras
  • Cat Boost
  • Matplot Lib
  • seaborn

We also used Google Colab's GPU notebooks. So we thank to Google for starting thier colab project for providing GPUs

Code details

Below is a brief description for the Code files/folder in repo.

Bash Script

  • settingUpDOServer.sh
    filepath: scripts/bash/settingUpDOServer.sh
    This is used for configuring the digital ocean server

NodeJs scripts

  • getting_repos_v2.js
    filepath: scripts/nodejs/getting_repos_v2.js
    This script fetches the basic info of repos having more than 100 stars using the Github REST API

  • githubGraphQLApiCallsDO_V2.js
    filepath: scripts/nodejs/githubGraphQLApiCallsDO_V2.js
    This script fetches the complete info of the repositories that were fetched by the above script and uses the Github GraphQL API. It follows the approach of fetching the data at the fixed rate defined in env file (eg. 730ms per request)

  • githubGraphQLApiCallsDO_V3.js
    filepath: scripts/nodejs/githubGraphQLApiCallsDO_V3.js
    This script fetches the complete info of the repositories that were fetched by the above script and uses the Github GraphQL API. It follows the approach of requesting data for next repo after receiving the response of the already sent request.

Python scripts

  • json_to_csv.py
    filepath: scripts/python/json_to_csv.py
    This script converts the json data fetched from Github's GraphQL API in the above script to the equivalent csv file.

  • merge.py
    filepath: scripts/python/merge.py
    This scripts merges all the data in multiple csv files to a single csv file

Jupyter Notebooks

  • VisualizePreprocess.ipynb
    filepath: notebooks/VisualizePreprocess.ipynb
    We have done the feature engineering task in this notebook. It visualises the data and correspondingly creates new features, modifies existing features and removes redundant features. For details on features created, check the summary below

  • training_models.ipynb
    filepath: notebooks/training_models.ipynb
    In this notebook, we trained different models with hyper parameter tuning on our dataset and compared their result in the end. For details on models trained, their prediction scores, etc. check the summary below.

Summary

In this project we have tried to predict the number of stars of a github repository that have more than 100 stars. For this we have taken the github repository data from github REST api and GraphQL api. After generating the dataset we visualized and did some feature engineering with the dataset and after that , finally we come up to the stage where we applied various models and predicted the model's scores on training and test data.

Feature Engineering

There are total of 49 features before pre-processing. After pre-processing (adding new features, removal of redundant features and modifying existing features) the count changes to 54. All the features are listed below. Some features after pre-processing may not be clear. Please refer to the VisualizePreprocess.ipynb notebook for details.

Original Features

column 1 column 2 column 3
branches commits createdAt
description diskUsage followers
following forkCount gistComments
gistStar gists hasWikiEnabled
iClosedComments iClosedParticipants iOpenComments
iOpenParticipants isArchived issuesClosed
issuesOpen license location
login members organizations
prClosed prClosedComments prClosedCommits
prMerged prMergedComments prMergedCommits
prOpen prOpenComments prOpenCommits
primaryLanguage pushedAt readmeCharCount
readmeLinkCount readmeSize readmeWordCount
releases reponame repositories
siteAdmin stars subscribersCount
tags type updatedAt
websiteUrl

Features after pre-processing

column 1 column 2 column 3
branches commits createdAt
diskUsage followers following
forkCount gistComments gistStar
gists hasWikiEnabled iClosedComments
iClosedParticipants iOpenComments iOpenParticipants
issuesClosed issuesOpen members
organizations prClosed prClosedComments
prClosedCommits prMerged prMergedComments
prMergedCommits prOpen prOpenComments
prOpenCommits pushedAt readmeCharCount
readmeLinkCount readmeSize readmeWordCount
releases repositories subscribersCount
tags type updatedAt
websiteUrl desWordCount desCharCount
mit_license nan_license apache_license
other_license remain_license JavaScript
Python Java Objective
Ruby PHP other_language

Models Trained

  • Gradient Boost Regressor
  • Cat Boost Regressor
  • Random Forest Regressor
  • Deep Neural Network

Evaluation Metrics

  • R^2 score
    R^2 score formula

Results

Result bar graph of different models

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].