All Projects → PlayingNumbers → Ds_salary_proj

PlayingNumbers / Ds_salary_proj

Repo for the data science salary prediction of the Data Science Project From Scratch video on my youtube

Projects that are alternatives of or similar to Ds salary proj

Fusion360gallerydataset
Data, tools, and documentation of the Fusion 360 Gallery Dataset
Stars: ✭ 118 (+1.72%)
Mutual labels:  jupyter-notebook
Bitcoin Price Prediction Using Sentiment Analysis
Predicts real-time bitcoin price using twitter and reddit sentiment, and sends out notifications via SMS.
Stars: ✭ 118 (+1.72%)
Mutual labels:  jupyter-notebook
Planet Amazon Deforestation
The open source repository for the Kaggle Amazon forest devastation competition https://www.kaggle.com/c/planet-understanding-the-amazon-from-space
Stars: ✭ 118 (+1.72%)
Mutual labels:  jupyter-notebook
Neural Painters Pytorch
PyTorch library for "Neural Painters: A learned differentiable constraint for generating brushstroke paintings"
Stars: ✭ 118 (+1.72%)
Mutual labels:  jupyter-notebook
Teach Me Quantum
⚛ 10 week Practical Course on Quantum Information Science and Quantum Computing - with Qiskit and IBMQX
Stars: ✭ 118 (+1.72%)
Mutual labels:  jupyter-notebook
Reinforcementlearning Atarigame
Pytorch LSTM RNN for reinforcement learning to play Atari games from OpenAI Universe. We also use Google Deep Mind's Asynchronous Advantage Actor-Critic (A3C) Algorithm. This is much superior and efficient than DQN and obsoletes it. Can play on many games
Stars: ✭ 118 (+1.72%)
Mutual labels:  jupyter-notebook
Machinelearninginjulia2020
Resources for a 3.5 hour workshop on machine learning using the MLJ toolbox
Stars: ✭ 118 (+1.72%)
Mutual labels:  jupyter-notebook
Senato.py
A scraper for the data made available by the Italian Senate, and a cluster analysis to detect similar amendments.
Stars: ✭ 118 (+1.72%)
Mutual labels:  jupyter-notebook
Pandas
pandas cheetsheet
Stars: ✭ 118 (+1.72%)
Mutual labels:  jupyter-notebook
Qiskit Tutorials
A collection of Jupyter notebooks showing how to use the Qiskit SDK
Stars: ✭ 1,777 (+1431.9%)
Mutual labels:  jupyter-notebook
Vcn
Volumetric Correspondence Networks for Optical Flow, NeurIPS 2019.
Stars: ✭ 118 (+1.72%)
Mutual labels:  jupyter-notebook
Amazonsagemakercourse
SageMaker Course Material
Stars: ✭ 118 (+1.72%)
Mutual labels:  jupyter-notebook
Pytextrank
Python implementation of TextRank for phrase extraction and summarization of text documents
Stars: ✭ 1,675 (+1343.97%)
Mutual labels:  jupyter-notebook
Reinvent2019 Aim362 Sagemaker Debugger Model Monitor
Build, train & debug, and deploy & monitor with Amazon SageMaker
Stars: ✭ 118 (+1.72%)
Mutual labels:  jupyter-notebook
Ysda deeplearning17
Yandex SDA classes on deep learning. Version of year 2017
Stars: ✭ 118 (+1.72%)
Mutual labels:  jupyter-notebook
Synapse
Samples for Azure Synapse Analytics
Stars: ✭ 115 (-0.86%)
Mutual labels:  jupyter-notebook
Vae Tensorflow
A Tensorflow implementation of a Variational Autoencoder for the deep learning course at the University of Southern California (USC).
Stars: ✭ 117 (+0.86%)
Mutual labels:  jupyter-notebook
Midi Dataset
Code for creating a dataset of MIDI ground truth
Stars: ✭ 118 (+1.72%)
Mutual labels:  jupyter-notebook
Tensorflow shiny
A R/Shiny app for interactive RNN tensorflow models
Stars: ✭ 118 (+1.72%)
Mutual labels:  jupyter-notebook
Statistical Learning Method
《统计学习方法》笔记-基于Python算法实现
Stars: ✭ 1,643 (+1316.38%)
Mutual labels:  jupyter-notebook

Data Science Salary Estimator: Project Overview

  • Created a tool that estimates data science salaries (MAE ~ $ 11K) to help data scientists negotiate their income when they get a job.
  • Scraped over 1000 job descriptions from glassdoor using python and selenium
  • Engineered features from the text of each job description to quantify the value companies put on python, excel, aws, and spark.
  • Optimized Linear, Lasso, and Random Forest Regressors using GridsearchCV to reach the best model.
  • Built a client facing API using flask

Code and Resources Used

Python Version: 3.7
Packages: pandas, numpy, sklearn, matplotlib, seaborn, selenium, flask, json, pickle
For Web Framework Requirements: pip install -r requirements.txt
Scraper Github: https://github.com/arapfaik/scraping-glassdoor-selenium
Scraper Article: https://towardsdatascience.com/selenium-tutorial-scraping-glassdoor-com-in-10-minutes-3d0915c6d905
Flask Productionization: https://towardsdatascience.com/productionize-a-machine-learning-model-with-flask-and-heroku-8201260503d2

YouTube Project Walk-Through

https://www.youtube.com/playlist?list=PL2zq7klxX5ASFejJj80ob9ZAnBHdz5O1t

Web Scraping

Tweaked the web scraper github repo (above) to scrape 1000 job postings from glassdoor.com. With each job, we got the following:

  • Job title
  • Salary Estimate
  • Job Description
  • Rating
  • Company
  • Location
  • Company Headquarters
  • Company Size
  • Company Founded Date
  • Type of Ownership
  • Industry
  • Sector
  • Revenue
  • Competitors

Data Cleaning

After scraping the data, I needed to clean it up so that it was usable for our model. I made the following changes and created the following variables:

  • Parsed numeric data out of salary
  • Made columns for employer provided salary and hourly wages
  • Removed rows without salary
  • Parsed rating out of company text
  • Made a new column for company state
  • Added a column for if the job was at the company’s headquarters
  • Transformed founded date into age of company
  • Made columns for if different skills were listed in the job description:
    • Python
    • R
    • Excel
    • AWS
    • Spark
  • Column for simplified job title and Seniority
  • Column for description length

EDA

I looked at the distributions of the data and the value counts for the various categorical variables. Below are a few highlights from the pivot tables.

alt text alt text alt text

Model Building

First, I transformed the categorical variables into dummy variables. I also split the data into train and tests sets with a test size of 20%.

I tried three different models and evaluated them using Mean Absolute Error. I chose MAE because it is relatively easy to interpret and outliers aren’t particularly bad in for this type of model.

I tried three different models:

  • Multiple Linear Regression – Baseline for the model
  • Lasso Regression – Because of the sparse data from the many categorical variables, I thought a normalized regression like lasso would be effective.
  • Random Forest – Again, with the sparsity associated with the data, I thought that this would be a good fit.

Model performance

The Random Forest model far outperformed the other approaches on the test and validation sets.

  • Random Forest : MAE = 11.22
  • Linear Regression: MAE = 18.86
  • Ridge Regression: MAE = 19.67

Productionization

In this step, I built a flask API endpoint that was hosted on a local webserver by following along with the TDS tutorial in the reference section above. The API endpoint takes in a request with a list of values from a job listing and returns an estimated salary.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].