All Projects → tirthajyoti → Synthetic-data-gen

tirthajyoti / Synthetic-data-gen

Licence: MIT License
Various methods for generating synthetic data for data science and ML

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Synthetic-data-gen

Java Deep Learning Cookbook
Code for Java Deep Learning Cookbook
Stars: ✭ 156 (+173.68%)
Mutual labels:  time-series, regression, classification
Pycaret
An open-source, low-code machine learning library in Python
Stars: ✭ 4,594 (+7959.65%)
Mutual labels:  time-series, regression, classification
machine learning from scratch matlab python
Vectorized Machine Learning in Python 🐍 From Scratch
Stars: ✭ 28 (-50.88%)
Mutual labels:  regression, classification
onelearn
Online machine learning methods
Stars: ✭ 14 (-75.44%)
Mutual labels:  regression, classification
Python-Machine-Learning-Fundamentals
D-Lab's 6 hour introduction to machine learning in Python. Learn how to perform classification, regression, clustering, and do model selection using scikit-learn and TPOT.
Stars: ✭ 46 (-19.3%)
Mutual labels:  regression, classification
time-series-classification
Classifying time series using feature extraction
Stars: ✭ 75 (+31.58%)
Mutual labels:  time-series, classification
Machine-Learning-Specialization
Project work and Assignments for Machine learning specialization course on Coursera by University of washington
Stars: ✭ 27 (-52.63%)
Mutual labels:  regression, classification
Predictive-Maintenance-of-Aircraft-Engine
In this project I aim to apply Various Predictive Maintenance Techniques to accurately predict the impending failure of an aircraft turbofan engine.
Stars: ✭ 48 (-15.79%)
Mutual labels:  regression, classification
Jhtalib
Technical Analysis Library Time-Series
Stars: ✭ 131 (+129.82%)
Mutual labels:  data, time-series
Python-Machine-Learning
Python Machine Learning Algorithms
Stars: ✭ 80 (+40.35%)
Mutual labels:  regression, classification
projection-pursuit
An implementation of multivariate projection pursuit regression and univariate classification
Stars: ✭ 24 (-57.89%)
Mutual labels:  regression, classification
wymlp
tiny fast portable real-time deep neural network for regression and classification within 50 LOC.
Stars: ✭ 36 (-36.84%)
Mutual labels:  regression, classification
R-Machine-Learning
D-Lab's 6 hour introduction to machine learning in R. Learn the fundamentals of machine learning, regression, and classification, using tidymodels in R.
Stars: ✭ 27 (-52.63%)
Mutual labels:  regression, classification
DataScience ArtificialIntelligence Utils
Examples of Data Science projects and Artificial Intelligence use cases
Stars: ✭ 302 (+429.82%)
Mutual labels:  time-series, regression
ugtm
ugtm: a Python package for Generative Topographic Mapping
Stars: ✭ 34 (-40.35%)
Mutual labels:  regression, classification
Data science blogs
A repository to keep track of all the code that I end up writing for my blog posts.
Stars: ✭ 139 (+143.86%)
Mutual labels:  data, time-series
InstantDL
InstantDL: An easy and convenient deep learning pipeline for image segmentation and classification
Stars: ✭ 33 (-42.11%)
Mutual labels:  regression, classification
pywedge
Makes Interactive Chart Widget, Cleans raw data, Runs baseline models, Interactive hyperparameter tuning & tracking
Stars: ✭ 49 (-14.04%)
Mutual labels:  regression, classification
Pycm
Multi-class confusion matrix library in Python
Stars: ✭ 1,076 (+1787.72%)
Mutual labels:  data, classification
Covid19
JSON time-series of coronavirus cases (confirmed, deaths and recovered) per country - updated daily
Stars: ✭ 1,177 (+1964.91%)
Mutual labels:  data, time-series

Synthetic-data-gen

Various methods for generating synthetic data for data science and ML.

Read my article on Medium "Synthetic data generation — a must-have skill for new data scientists"

Also, a related article on generating random variables from scratch: "How to generate random variables from scratch (no library used)"


Notebooks

Why do you need the skill of synthetic data generation?

Imagine you are tinkering with a cool machine learning algorithm like SVM or a deep neural net. What kind of dataset you should practice them on? If you are learning from scratch, the advice is to start with simple, small-scale datasets which you can plot in two dimensions to understand the patterns visually and see for yourself the working of the ML algorithm in an intuitive fashion. For example, here is an excellent article on various datasets you can try at various level of learning.

This is a great start. But it is not all.

Sure, you can go up a level and find yourself a real-life large dataset to practice the algorithm on. But that is still a fixed dataset, with a fixed number of samples, a fixed pattern, and a fixed degree of class separation between positive and negative samples (if we assume it to be a classification problem). Are you learning all the intricacies of the algorithm in terms of

  • sample complexity,
  • computational efficiency,
  • ability to handle class imbalance,
  • robustness of the metrics in the face of varying degree of class separation
  • bias-variance trade-off as a function of data complexity

Probably not. Perhaps, no single dataset can lend all these deep insights for a given ML algorithm. But, these are extremely important insights to master for you to become a true expert practitioner of machine learning. So, you will need an extremely rich and sufficiently large dataset, which is amenable enough for all these experimentation.

So, what can you do in this situation? Scour the internet for more datasets and just hope that some of them will bring out the limitations and challenges, associated with a particular algorithm, and help you learn?

Yes, it is a possible approach but may not be the most viable or optimal one in terms of time and effort. Good datasets may not be clean or easily obtainable. You may spend much more time looking for, extracting, and wrangling with a suitable dataset than putting that effort to understand the ML algorithm.

Make no mistake. The experience of searching for a real life dataset, extracting it, running exploratory data analysis, and wrangling with it to make it suitably prepared for a machine learning based modeling is invaluable. I know because I wrote a book about it :-)

But that can be taught and practiced separately. In many situations, however, you may just want to have access to a flexible dataset (or several of them) to ‘teach’ you the ML algorithm in all its gory details.

Surprisingly enough, in many cases, such teaching can be done with synthetic datasets.

What is a synthetic dataset?

As the name suggests, quite obviously, a synthetic dataset is a repository of data that is generated programmatically. So, it is not collected by any real-life survey or experiment. Its main purpose, therefore, is to be flexible and rich enough to help an ML practitioner conduct fascinating experiments with various classification, regression, and clustering algorithms. Desired properties are,

  • It can be numerical, binary, or categorical (ordinal or non-ordinal),
  • The number of features and length of the dataset should be arbitrary
  • It should preferably be random and the user should be able to choose a wide variety of statistical distribution to base this data upon i.e. the underlying random process can be precisely controlled and tuned,
  • If it is used for classification algorithms, then the degree of class separation should be controllable to make the learning problem easy or hard,
  • Random noise can be interjected in a controllable manner
  • For a regression problem, a complex, non-linear generative process can be used for sourcing the data
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].