D-Lab's 6 hour introduction to machine learning in Python. Learn how to perform classification, regression, clustering, and do model selection using scikit-learn and TPOT.

Stars: ✭ 46 (-19.3%)

Mutual labels: regression, classification

time-series-classification

Classifying time series using feature extraction

Stars: ✭ 75 (+31.58%)

Mutual labels: time-series, classification

Machine-Learning-Specialization

Project work and Assignments for Machine learning specialization course on Coursera by University of washington

Stars: ✭ 27 (-52.63%)

Mutual labels: regression, classification

Predictive-Maintenance-of-Aircraft-Engine

In this project I aim to apply Various Predictive Maintenance Techniques to accurately predict the impending failure of an aircraft turbofan engine.

Stars: ✭ 48 (-15.79%)

Mutual labels: regression, classification

Jhtalib

Technical Analysis Library Time-Series

Stars: ✭ 131 (+129.82%)

Mutual labels: data, time-series

Python-Machine-Learning

Python Machine Learning Algorithms

Stars: ✭ 80 (+40.35%)

Mutual labels: regression, classification

projection-pursuit

An implementation of multivariate projection pursuit regression and univariate classification

Stars: ✭ 24 (-57.89%)

Mutual labels: regression, classification

wymlp

tiny fast portable real-time deep neural network for regression and classification within 50 LOC.

Stars: ✭ 36 (-36.84%)

Mutual labels: regression, classification

R-Machine-Learning

D-Lab's 6 hour introduction to machine learning in R. Learn the fundamentals of machine learning, regression, and classification, using tidymodels in R.

Stars: ✭ 27 (-52.63%)

Mutual labels: regression, classification

DataScience ArtificialIntelligence Utils

Examples of Data Science projects and Artificial Intelligence use cases

Stars: ✭ 302 (+429.82%)

Mutual labels: time-series, regression

ugtm

ugtm: a Python package for Generative Topographic Mapping

Stars: ✭ 34 (-40.35%)

Mutual labels: regression, classification

Data science blogs

A repository to keep track of all the code that I end up writing for my blog posts.

Stars: ✭ 139 (+143.86%)

Mutual labels: data, time-series

InstantDL

InstantDL: An easy and convenient deep learning pipeline for image segmentation and classification

Stars: ✭ 33 (-42.11%)

Mutual labels: regression, classification

pywedge

Makes Interactive Chart Widget, Cleans raw data, Runs baseline models, Interactive hyperparameter tuning & tracking

Stars: ✭ 49 (-14.04%)

Mutual labels: regression, classification

Pycm

Multi-class confusion matrix library in Python

Stars: ✭ 1,076 (+1787.72%)

Mutual labels: data, classification

Covid19

JSON time-series of coronavirus cases (confirmed, deaths and recovered) per country - updated daily

Stars: ✭ 1,177 (+1964.91%)

Mutual labels: data, time-series

View All Similar Projects ➔

Synthetic-data-gen

Various methods for generating synthetic data for data science and ML.

Read my article on Medium "Synthetic data generation — a must-have skill for new data scientists"

Also, a related article on generating random variables from scratch: "How to generate random variables from scratch (no library used)"

Notebooks

Why do you need the skill of synthetic data generation?

Imagine you are tinkering with a cool machine learning algorithm like SVM or a deep neural net. What kind of dataset you should practice them on? If you are learning from scratch, the advice is to start with simple, small-scale datasets which you can plot in two dimensions to understand the patterns visually and see for yourself the working of the ML algorithm in an intuitive fashion. For example, here is an excellent article on various datasets you can try at various level of learning.

This is a great start. But it is not all.

Sure, you can go up a level and find yourself a real-life large dataset to practice the algorithm on. But that is still a fixed dataset, with a fixed number of samples, a fixed pattern, and a fixed degree of class separation between positive and negative samples (if we assume it to be a classification problem). Are you learning all the intricacies of the algorithm in terms of

sample complexity,
computational efficiency,
ability to handle class imbalance,
robustness of the metrics in the face of varying degree of class separation
bias-variance trade-off as a function of data complexity

Probably not. Perhaps, no single dataset can lend all these deep insights for a given ML algorithm. But, these are extremely important insights to master for you to become a true expert practitioner of machine learning. So, you will need an extremely rich and sufficiently large dataset, which is amenable enough for all these experimentation.

So, what can you do in this situation? Scour the internet for more datasets and just hope that some of them will bring out the limitations and challenges, associated with a particular algorithm, and help you learn?

Yes, it is a possible approach but may not be the most viable or optimal one in terms of time and effort. Good datasets may not be clean or easily obtainable. You may spend much more time looking for, extracting, and wrangling with a suitable dataset than putting that effort to understand the ML algorithm.

Make no mistake. The experience of searching for a real life dataset, extracting it, running exploratory data analysis, and wrangling with it to make it suitably prepared for a machine learning based modeling is invaluable. I know because I wrote a book about it :-)

But that can be taught and practiced separately. In many situations, however, you may just want to have access to a flexible dataset (or several of them) to ‘teach’ you the ML algorithm in all its gory details.

Surprisingly enough, in many cases, such teaching can be done with synthetic datasets.

What is a synthetic dataset?

As the name suggests, quite obviously, a synthetic dataset is a repository of data that is generated programmatically. So, it is not collected by any real-life survey or experiment. Its main purpose, therefore, is to be flexible and rich enough to help an ML practitioner conduct fascinating experiments with various classification, regression, and clustering algorithms. Desired properties are,

It can be numerical, binary, or categorical (ordinal or non-ordinal),
The number of features and length of the dataset should be arbitrary
It should preferably be random and the user should be able to choose a wide variety of statistical distribution to base this data upon i.e. the underlying random process can be precisely controlled and tuned,
If it is used for classification algorithms, then the degree of class separation should be controllable to make the learning problem easy or hard,
Random noise can be interjected in a controllable manner
For a regression problem, a complex, non-linear generative process can be used for sourcing the data

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

tirthajyoti / Synthetic-data-gen

Programming Languages

Labels

Projects that are alternatives of or similar to Synthetic-data-gen

Synthetic-data-gen

Notebooks

Why do you need the skill of synthetic data generation?

What is a synthetic dataset?