All Projects → allentran → Pca Magic

allentran / Pca Magic

Licence: apache-2.0
PCA that iteratively replaces missing data

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Pca Magic

Vizuka
Explore high-dimensional datasets and how your algo handles specific regions.
Stars: ✭ 100 (-45.95%)
Mutual labels:  data-science, pca
H2o 3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Stars: ✭ 5,656 (+2957.3%)
Mutual labels:  data-science, pca
Book list
Python, Machine Learning, Deep Learning and Data Science Books
Stars: ✭ 176 (-4.86%)
Mutual labels:  data-science
Awesome Computer Science Opportunities
An awesome list of events and fellowship opportunities for Computer Science students
Stars: ✭ 2,445 (+1221.62%)
Mutual labels:  data-science
Andrew Ng Notes
This is Andrew NG Coursera Handwritten Notes.
Stars: ✭ 180 (-2.7%)
Mutual labels:  data-science
Metrics
Machine learning metrics for distributed, scalable PyTorch applications.
Stars: ✭ 162 (-12.43%)
Mutual labels:  data-science
Computationalhealthcare
A platform for analysis & development of machine learning models using large de-identified healthcare datasets.
Stars: ✭ 180 (-2.7%)
Mutual labels:  data-science
Web Database Analytics
Web scrapping and related analytics using Python tools
Stars: ✭ 175 (-5.41%)
Mutual labels:  data-science
Collapse
Advanced and Fast Data Transformation in R
Stars: ✭ 184 (-0.54%)
Mutual labels:  data-science
Docker Galaxy Stable
🐳📊📚 Docker Images tracking the stable Galaxy releases.
Stars: ✭ 179 (-3.24%)
Mutual labels:  data-science
Imbalanced Algorithms
Python-based implementations of algorithms for learning on imbalanced data.
Stars: ✭ 180 (-2.7%)
Mutual labels:  data-science
Data Science Masters
Self-study plan to achieve mastery in data science
Stars: ✭ 179 (-3.24%)
Mutual labels:  data-science
Soda Sql
Metric collection, data testing and monitoring for SQL accessible data
Stars: ✭ 173 (-6.49%)
Mutual labels:  data-science
Lets Plot Kotlin
Kotlin API for Lets-Plot - an open-source plotting library for statistical data.
Stars: ✭ 181 (-2.16%)
Mutual labels:  data-science
Chefboost
A Lightweight Decision Tree Framework supporting regular algorithms: ID3, C4,5, CART, CHAID and Regression Trees; some advanced techniques: Gradient Boosting (GBDT, GBRT, GBM), Random Forest and Adaboost w/categorical features support for Python
Stars: ✭ 176 (-4.86%)
Mutual labels:  data-science
Awesome R Learning Resources
A curated collection of free resources to help deepen your understanding of the R programming language. Updated regularly. Contributions encouraged via pull request (see contributing.md).
Stars: ✭ 181 (-2.16%)
Mutual labels:  data-science
Scikit Plot
An intuitive library to add plotting functionality to scikit-learn objects.
Stars: ✭ 2,162 (+1068.65%)
Mutual labels:  data-science
Deep Rules
Ten Quick Tips for Deep Learning in Biology
Stars: ✭ 179 (-3.24%)
Mutual labels:  data-science
Ml Glossary
Machine learning glossary
Stars: ✭ 2,338 (+1163.78%)
Mutual labels:  data-science
Homlr
Supplementary material for Hands-On Machine Learning with R, an applied book covering the fundamentals of machine learning with R.
Stars: ✭ 185 (+0%)
Mutual labels:  data-science

CircleCI

pca-magic

An implementaton of probabilisitc principal components analysis which is a variant of vanilla PCA that can be used to

  • compute factors where some of the data are missing
  • interpolate data by using information from additional series

Often, you want to use PCA but your data is smattered with missing data. See below, where the white represents missing data in 14k+ time series in the Current Population Survey, a monthly survey of about 60k households conducted by the United States Census Bureau since 1940.

CPS missing data

If enough of the data is not missing, you can fill in the missing data with sample means or some other interpolated value but if you have too much missing data, your rudimentary interpolation is going to overwhelm the signal in the data with noise. (Think about the limiting case with all but one missing data point).

A better way: suppose you had the latent factors representing the matrix. Construct a linear model for each series and then use the resulting model for interpolation. Intuitively, this will preserve the signal from the data as the interpolated values come from latent factors.

However, the problem is you never have these factors to begin with. The old chicken and egg problem. But no matter, fixed point algorithms via Probabilistic PCA to the rescue.

With this strategy, over 50 percent of the variance in those 14k+ time series in the CPS can be explained by just 12 factors.

CPS components

Installation

Install via pip:

pip install ppca

Load in the data which should be arranged as n_samples by features. As usual, you should make sure your data is stationary (take first differences if possible) and standardized.

from ppca import PPCA
ppca = PPCA()

Fit the model with parameter d specifying the number of components and verbose printing convergence output if required.

ppca.fit(data=data, d=100, verbose=True)

The model parameters and components will be attached to the ppca object.

variance_explained = ppca.var_exp
components = ppca.data
model_params = ppca.C

If you want the principal components, call transform.

component_mat = ppca.transform()

Post fitting the model, save the model if you want.

ppca.save('mypcamodel')

Load a model, post instantiating a PPCA object. This will make fitting/transforming much faster.

ppca.load('mypcamodel.npy')
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].