All Projects → LastAncientOne → Data-Science

LastAncientOne / Data-Science

Licence: MIT license
Using Kaggle Data and Real World Data for Data Science and prediction in Python, R, Excel, Power BI, and Tableau.

Programming Languages

Jupyter Notebook
11667 projects
HTML
75241 projects

Projects that are alternatives of or similar to Data-Science

Kaggle Competitions
There are plenty of courses and tutorials that can help you learn machine learning from scratch but here in GitHub, I want to solve some Kaggle competitions as a comprehensive workflow with python packages. After reading, you can use this workflow to solve other real problems and use it as a template.
Stars: ✭ 86 (+473.33%)
Mutual labels:  exploratory-data-analysis, kaggle, feature-engineering
Bike-Sharing-Demand-Kaggle
Top 5th percentile solution to the Kaggle knowledge problem - Bike Sharing Demand
Stars: ✭ 33 (+120%)
Mutual labels:  kaggle, datascience, feature-engineering
student-grade-analytics
Analyse academic and non-academic information of students and predict grades
Stars: ✭ 17 (+13.33%)
Mutual labels:  exploratory-data-analysis, datascience, exploratory-data-visualizations
How-to-score-0.8134-in-Titanic-Kaggle-Challenge
Solution of the Titanic Kaggle competition
Stars: ✭ 114 (+660%)
Mutual labels:  exploratory-data-analysis, kaggle
Deep Learning Machine Learning Stock
Stock for Deep Learning and Machine Learning
Stars: ✭ 240 (+1500%)
Mutual labels:  prediction, feature-engineering
kushner eb5 census
Jared Kushner and his partners used a program meant for job-starved areas to build a luxury skyscraper
Stars: ✭ 49 (+226.67%)
Mutual labels:  exploratory-data-analysis, exploratory-data-visualizations
learnr
Exploratory, Inferential and Predictive data analysis. Feel free to show your ❤️ by giving a star ⭐
Stars: ✭ 64 (+326.67%)
Mutual labels:  exploratory-data-analysis, inferential-statistics
adenine
ADENINE: A Data ExploratioN PipelINE
Stars: ✭ 15 (+0%)
Mutual labels:  exploratory-data-analysis, dimensionality-reduction
Code
Compilation of R and Python programming codes on the Data Professor YouTube channel.
Stars: ✭ 287 (+1813.33%)
Mutual labels:  exploratory-data-analysis, datascience
datapackage-m
Power Query M functions for working with Tabular Data Packages (Frictionless Data) in Power BI and Excel
Stars: ✭ 26 (+73.33%)
Mutual labels:  excel, powerbi
100 Days Of Ml Code
A day to day plan for this challenge. Covers both theoritical and practical aspects
Stars: ✭ 172 (+1046.67%)
Mutual labels:  exploratory-data-analysis, datascience
Power-Query-Excel-Formats
A collection of M code to get various formats from Excel sheets in Power Query
Stars: ✭ 43 (+186.67%)
Mutual labels:  excel, powerbi
Open Solution Toxic Comments
Open solution to the Toxic Comment Classification Challenge
Stars: ✭ 154 (+926.67%)
Mutual labels:  prediction, kaggle
Mlbox
MLBox is a powerful Automated Machine Learning python library.
Stars: ✭ 1,199 (+7893.33%)
Mutual labels:  prediction, kaggle
vlainic.github.io
My GitHub blog: things you might be interested, and probably not...
Stars: ✭ 26 (+73.33%)
Mutual labels:  prediction, datascience
MSDS696-Masters-Final-Project
Earthquake Prediction Challenge with LightGBM and XGBoost
Stars: ✭ 58 (+286.67%)
Mutual labels:  prediction, kaggle
LibPQ
Detach your M code from workbooks to reuse it! Import modules from local or web storage (unlimited number of sources)
Stars: ✭ 55 (+266.67%)
Mutual labels:  excel, powerbi
Lightautoml
LAMA - automatic model creation framework
Stars: ✭ 196 (+1206.67%)
Mutual labels:  kaggle, feature-engineering
Nyaggle
Code for Kaggle and Offline Competitions
Stars: ✭ 209 (+1293.33%)
Mutual labels:  kaggle, feature-engineering
Complete Life Cycle Of A Data Science Project
Complete-Life-Cycle-of-a-Data-Science-Project
Stars: ✭ 140 (+833.33%)
Mutual labels:  exploratory-data-analysis, feature-engineering

Contributors Forks Stargazers Issues MIT License LinkedIn

"Buy Me A Coffee"

Programming Language and Software Software Links
Data Science in Python Python
Data Science in R R
Data Science in Excel Excel
Data Science in Power BI Power BI
Data Science in Tableau Tableau

Data Science

Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and quantitative data and qualitative data. Apply knowledge and actionable insights from data across a broad range of application domains. (Wikipedia)

This is a practice of programming skills, and knowledge of mathematics and statistics to extract meaningful insights from structure and unstructured data (kaggle data and real world data). Learning step-by-step in data science. Learn analytical techniques, statistics, and research methods. The most common use used methods are Regression, Clustering, Visualization, Decision Trees/Rules, and Random Forest in data science. Learning the process in analyzing data in python, R, Excel, Power BI, and tableau. In addition, learn to become a data scientists and expanding more knowledge in machine learning and deep learning. Understanding data and analyzing data.

Completed Staff Work (CSW)

Completed Staff Work is similar to data analysis. Completed Staff Work enables decision makers to find solutions to problems or address issues after consideration of reasonable, workable, carefully considered alternatives.

7 Step to CSW

1. Identify, describe, or define the problems.

2. Gather or compile information about the problem.

3. Organize information for review & consideration.

4. Analyze or evaluate the information.

5. Develop, compile or generate alternatives.

6. Select or identify the solution you want to recommend based on the results of your objective analysis.

7. Develop a plan to implement the solution and the documents necessary to authorize the implementation.

Prerequisites

Python 3.5+

R 3.5.3+

Excel 2016+

Power BI

Tableau

🔷 Getting Started with Data Science 🔷

🔵 Step-by-Step to Data Science

  • Define Problem
  • Data Collection
  • Data Understanding
  • Data Analysis/Cleaning
  • Data Organization/Transformation
  • Data Validation/Anomaly Detection
  • Feature Engineering
  • Model Training
  • Model Evaluation/validation
  • Model Monitoring
  • Model Deployment
  • Data Drift/Model Drift
  • Reports

🔵 Three Types of Position in Data Science

  • Data Engineer
    • Develops, constructs, tests, and maintains architectures such as databases and large-scales processing systems.
  • Data Analyst
    • Interprets data and turns data into information which can offer ways to improve business.
    • Gather information from various sources and intrepret patterns and trends.
  • Machine Learning Scientist
    • Research and developed algorithms.
    • Predictions from data with labels and features.
    • Create a predictive models.

🔵 Types of Data Analysis: Techniques and Methods

  • Descriptive Analysis
  • Text Analysis
  • Statistical Analysis
  • Diagnostic Analysis
  • Predictive Analysis
  • Prescriptive Analysis

🔵 Two Types of Data

  • Supervised Data (Data pre-categorized or numerical)
    • Classification (Predict a category)
    • Regression (Predict a number)
  • Unsupervised Data (Data is not labeled in any way)
    • Clustering (Divide by similarity)
    • Dimension Reduction (Generalization) - Find hidden dependencies
    • Association (Identify Sequences)

🔵 Learning about Exploratory Data Analysis

  • Import, read, clean, and validate

    • Define Variables
      1. Y is "Dependent Variable" and goes on y-axis (the left side, vertical one) - output value
      2. X is "Independent Variable" and goes on the x-axis (the bottom, horizontal one) - input value
    • Type of Data
      1. Quantitative
      • Ratio or Interval
        • Discrete and Continuous
          Discrete variables can only take certain numerical values and are counted
          Continuous variables can take any numerical value and are measured
      1. Qualitative
      • Norminal or Ordinal
        • Binary, nominal data, and ordinal data
          Categorical variables take category or label values and place an individual into one of several groups.
    • Type of data measurements
      1. Nominal - names or labels variable
        For example, gender: male and female. Other examples include eye colour and hair colour.
      2. Ordinal - non-numeric concepts like satisfaction, happiness, discomfort, etc.
        For example: is rating happiness on a scale of 1-10.
      3. Interval - numeric scales in which we know both the order and the exact differences between the values
        For example: interval data is temperature, the difference in temperature between 10-20 degrees is the same as the difference in temperature between 20-30 degrees. Likert scale is type of data. Likert scale is composed of a series of four or more Likert-type items that represent similar questions combined into a single composite score/variable. Likert scale data can be analyzed as interval data, i.e. the mean is the best measure of central tendency. use means and standard deviations to describe the scale. For example, it is a rating scale, often found on survey forms, that measures how people feel about something. It includes a series of questions that you ask people to answer, and ideally 5-7 balanced responses people can choose from. It often comes with a neutral midpoint.
      4. Ratio - measurement scales
        For example: data it must have a true zero, meaning it is not possible to have negative values in ratio data. Ratio data is measurements of height be that centimetres, metres, inches or feet.
  • Visualize distributions

    • Univariate visualization
    • Bivariate visualization
    • Multivariate visualization
    • Dimensionality reduction
  • Explore relations between variables

    • Descriptive statistics
    • Inferential statistics
    • Statistical graphics
  • Explore multivariate relationships

  • Statistical Analysis

    • Cases, Variables, Types of Variables
    • Matrix and Frequency Table
    • Graphs and Shapes of Distributions
    • Mode, Median and Mean
    • Range, Interquartile Range and Box Plot
    • Variance and Standard deviation
    • Z-scores
    • Contingency Table, Scatterplot, Pearson’s
    • Basics of Regression
    • Elementary Probability
    • Random Variables and Probability Distributions
    • Normal Distribution, Binomial Distribution & Poisson Distribution
    • Hypothesis
      3 Steps:
      (1) Making an initial assumption.
      (2) Collecting evidence (data).
      (3) Based on the available evidence (data), deciding whether to reject or not reject the initial assumption.
  • Inferential Statistics

    • Observational Studies and Experiments
    • Sample and Population
    • Population Distribution, Sample Distribution and Sampling Distribution
    • Central Limit Theorem
    • Point Estimates
    • Confidence Intervals
    • Introduction to Hypothesis Testing
  • Questions about data

    • Do you have the right data for exploratory data anlaysis?
    • Do you need other data?
    • Do you have the right question?

🔵 Learning to be Data Science

  • Choose Programming Language
    • Python or R
  • Mathematics and Linear Algebra
  • Big Data
  • Data Visualization
  • Data Cleaning
  • How to solve Problem?
  • Machine Learning
    • Type of algorithms performs the learning
    1. Supervised Learning
    • Dataset has labels
    • Classification
      • Binary Classification
      • Multiclass Classification
      • Multilabel Classification
    • Regression
      • Linear Regression: Linear relationships between inputs and outputs
      • Logistic Regression: Probability of a binary output
    1. Unsupervised Learning
    • Dataset is unlabeled
    1. Semi-supervised Learning
    • Dataset contains labeled and unlabeled
    1. Reinforcement Learning
    • Learns from mistakes
    • Agent take "actions" in an environment and see the "state" of environment with the features
    • Excute actions in every state with different actions bring different "rewards"
    • It learns "policy".
  • Common Machine Learning Algorithms
  1. Linear Regression
  2. Logistic Regression
  3. Decision Tree
  4. SVM
  5. Naive Bayes
  6. kNN
  7. K-Means
  8. Random Forest
  9. Dimensionality Reduction Algorithms
  10. Gradient Boosting algorithms
  • Deep Learning
    • Common Library
    1. TensorFlow
    2. Keras
    3. Theano
    4. Pytorch
    5. sklearn
    6. Caffe
    7. Apache Spark
    8. Chainer

🔵 Underfitting and Overfitting

Overfitting

  1. Overfitting - the gap between training and test error is larger.
  2. Overfitting - the training error is smaller than test error.
  3. Overfitting - the larger hypothesis space, there is a higher tendancy for the model to overfit the training dataset.
  4. A model suffering from overfitting will have high variance and low bias.

Fixing Overfitting

  1. Simplify the model (fewer parameters)
  2. Simplify training data (fewer attributes)
  3. Constrain the model (regularization)
  4. Use ccross-validation
  5. Use Early stopping
  6. Build an ensemble
  7. Gather more data

Underfitting

  1. Underfitting - both the training and test error are larger.
  2. A model suffering from underfitting will have high bias and low variance.

Fixing Underfitting

  1. Increase model complexity (more parameters)
  2. Increase number of features
  3. Feature engineer should help
  4. Un-constrain the model (no regularization)
  5. Reduce or remove noise on the data
  6. Train for longer

🔵 Learning to improve the Model or Prediction

  • Improve the "Accuracy" of Machine Learning Model
  1. Add More Data
  2. Add More Features
  3. Feature Engineering
  4. Feature Selection
  5. Use Regularization
  6. Multiple Alogrithms
  7. Ensemble Methods
  8. Cross Validation
  9. Algorithm Tuning
  10. Bagging or Boosting

"Buy Me A Coffee"

Author:

Tin Hang

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].