All Projects β†’ danielhanchen β†’ Hyperlearn

danielhanchen / Hyperlearn

Licence: bsd-3-clause
50% faster, 50% less RAM Machine Learning. Numba rewritten Sklearn. SVD, NNMF, PCA, LinearReg, RidgeReg, Randomized, Truncated SVD/PCA, CSR Matrices all 50+% faster

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Hyperlearn

Ml Workspace
πŸ›  All-in-one web-based IDE specialized for machine learning and data science.
Stars: ✭ 2,337 (+94.1%)
Mutual labels:  jupyter-notebook, data-science, data-analysis, gpu, scikit-learn
Bayesian Cognitive Modeling In Pymc3
PyMC3 codes of Lee and Wagenmakers' Bayesian Cognitive Modeling - A Pratical Course
Stars: ✭ 93 (-92.28%)
Mutual labels:  jupyter-notebook, data-science, statistics, data-analysis
Datacamp
🍧 A repository that contains courses I have taken on DataCamp
Stars: ✭ 69 (-94.27%)
Mutual labels:  jupyter-notebook, data-science, statistics, data-analysis
Interactive machine learning
IPython widgets, interactive plots, interactive machine learning
Stars: ✭ 140 (-88.37%)
Mutual labels:  jupyter-notebook, data-science, statistics, scikit-learn
Pandas Profiling
Create HTML profiling reports from pandas DataFrame objects
Stars: ✭ 8,329 (+591.78%)
Mutual labels:  jupyter-notebook, data-science, statistics, data-analysis
Virgilio
Virgilio is developed and maintained by these awesome people. You can email us virgilio.datascience (at) gmail.com or join the Discord chat.
Stars: ✭ 13,200 (+996.35%)
Mutual labels:  jupyter-notebook, data-science, statistics, scikit-learn
Dat8
General Assembly's 2015 Data Science course in Washington, DC
Stars: ✭ 1,516 (+25.91%)
Mutual labels:  jupyter-notebook, data-science, data-analysis, scikit-learn
Amazing Feature Engineering
Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.
Stars: ✭ 218 (-81.89%)
Mutual labels:  jupyter-notebook, data-science, data-analysis, scikit-learn
Covid19 Severity Prediction
Extensive and accessible COVID-19 data + forecasting for counties and hospitals. πŸ“ˆ
Stars: ✭ 170 (-85.88%)
Mutual labels:  jupyter-notebook, data-science, statistics, data-analysis
Machine Learning With Python
Practice and tutorial-style notebooks covering wide variety of machine learning techniques
Stars: ✭ 2,197 (+82.48%)
Mutual labels:  jupyter-notebook, data-science, statistics, scikit-learn
Imodels
Interpretable ML package πŸ” for concise, transparent, and accurate predictive modeling (sklearn-compatible).
Stars: ✭ 194 (-83.89%)
Mutual labels:  jupyter-notebook, data-science, statistics, scikit-learn
Awesome Python Data Science
Probably the best curated list of data science software in Python.
Stars: ✭ 812 (-32.56%)
Mutual labels:  data-science, statistics, data-analysis, scikit-learn
Crime Analysis
Association Rule Mining from Spatial Data for Crime Analysis
Stars: ✭ 20 (-98.34%)
Mutual labels:  jupyter-notebook, data-science, scikit-learn
Data Science On Gcp
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Stars: ✭ 864 (-28.24%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
My Journey In The Data Science World
πŸ“’ Ready to learn or review your knowledge!
Stars: ✭ 1,175 (-2.41%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
Python for ml
brief introduction to Python for machine learning
Stars: ✭ 29 (-97.59%)
Mutual labels:  jupyter-notebook, data-science, scikit-learn
Mlcourse.ai
Open Machine Learning Course
Stars: ✭ 7,963 (+561.38%)
Mutual labels:  data-science, data-analysis, scikit-learn
Mlj.jl
A Julia machine learning framework
Stars: ✭ 982 (-18.44%)
Mutual labels:  jupyter-notebook, data-science, statistics
Optimus
🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (-18.11%)
Mutual labels:  jupyter-notebook, data-science, data-analysis
Socrat
A Dynamic Web Toolbox for Interactive Data Processing, Analysis, and Visualization
Stars: ✭ 26 (-97.84%)
Mutual labels:  data-science, statistics, data-analysis

[Due to the time taken @ uni, work + hell breaking loose in my life, since things have calmed down a bit, will continue commiting!!!] [By the way, I'm still looking for new contributors! Please help make HyperLearn no1!!]

drawing

HyperLearn is what drives Umbra's AI engines. It is open source to everyone, everywhere, and we hope humanity can rise to the stars.

[Notice - I will be updating the package monthly or bi-weekly due to other commitments]


drawing https://hyperlearn.readthedocs.io/en/latest/index.html

Faster, Leaner GPU Sklearn, Statsmodels written in PyTorch

GitHub issues Github All Releases

50%+ Faster, 50%+ less RAM usage, GPU support re-written Sklearn, Statsmodels combo with new novel algorithms.

HyperLearn is written completely in PyTorch, NoGil Numba, Numpy, Pandas, Scipy & LAPACK, and mirrors (mostly) Scikit Learn. HyperLearn also has statistical inference measures embedded, and can be called just like Scikit Learn's syntax (model.confidence_interval_) Ongoing documentation: https://hyperlearn.readthedocs.io/en/latest/index.html

I'm also writing a mini book! A sneak peak: drawing

drawing

Comparison of Speed / Memory

Algorithm n p Time(s) RAM(mb) Notes
Sklearn Hyperlearn Sklearn Hyperlearn
QDA (Quad Dis A) 1000000 100 54.2 22.25 2,700 1,200 Now parallelized
LinearRegression 1000000 100 5.81 0.381 700 10 Guaranteed stable & fast

Time(s) is Fit + Predict. RAM(mb) = max( RAM(Fit), RAM(Predict) )

I've also added some preliminary results for N = 5000, P = 6000 drawing

Since timings are not good, I have submitted 2 bug reports to Scipy + PyTorch:

  1. EIGH very very slow --> suggesting an easy fix #9212 https://github.com/scipy/scipy/issues/9212
  2. SVD very very slow and GELS gives nans, -inf #11174 https://github.com/pytorch/pytorch/issues/11174

Help is really needed! Message me!


Key Methodologies and Aims

1. Embarrassingly Parallel For Loops

2. 50%+ Faster, 50%+ Leaner

3. Why is Statsmodels sometimes unbearably slow?

4. Deep Learning Drop In Modules with PyTorch

5. 20%+ Less Code, Cleaner Clearer Code

6. Accessing Old and Exciting New Algorithms


1. Embarrassingly Parallel For Loops

  • Including Memory Sharing, Memory Management
  • CUDA Parallelism through PyTorch & Numba

2. 50%+ Faster, 50%+ Leaner

3. Why is Statsmodels sometimes unbearably slow?

  • Confidence, Prediction Intervals, Hypothesis Tests & Goodness of Fit tests for linear models are optimized.
  • Using Einstein Notation & Hadamard Products where possible.
  • Computing only what is necessary to compute (Diagonal of matrix and not entire matrix).
  • Fixing the flaws of Statsmodels on notation, speed, memory issues and storage of variables.

4. Deep Learning Drop In Modules with PyTorch

  • Using PyTorch to create Scikit-Learn like drop in replacements.

5. 20%+ Less Code, Cleaner Clearer Code

  • Using Decorators & Functions where possible.
  • Intuitive Middle Level Function names like (isTensor, isIterable).
  • Handles Parallelism easily through hyperlearn.multiprocessing

6. Accessing Old and Exciting New Algorithms

  • Matrix Completion algorithms - Non Negative Least Squares, NNMF
  • Batch Similarity Latent Dirichelt Allocation (BS-LDA)
  • Correlation Regression
  • Feasible Generalized Least Squares FGLS
  • Outlier Tolerant Regression
  • Multidimensional Spline Regression
  • Generalized MICE (any model drop in replacement)
  • Using Uber's Pyro for Bayesian Deep Learning

Goals & Development Schedule

Will Focus on & why:

1. Singular Value Decomposition & QR Decomposition

* SVD/QR is the backbone for many algorithms including:
    * Linear & Ridge Regression (Regression)
    * Statistical Inference for Regression methods (Inference)
    * Principal Component Analysis (Dimensionality Reduction)
    * Linear & Quadratic Discriminant Analysis (Classification & Dimensionality Reduction)
    * Pseudoinverse, Truncated SVD (Linear Algebra)
    * Latent Semantic Indexing LSI (NLP)
    * (new methods) Correlation Regression, FGLS, Outlier Tolerant Regression, Generalized MICE, Splines (Regression)

On Licensing: HyperLearn is under a GNU v3 License. This means:

  1. Commercial use is restricted. Only software with 0 cost can be released. Ie: no closed source versions are allowed.
  2. Using HyperLearn must entail all of the code being avaliable to everyone who uses your public software.
  3. HyperLearn is intended for academic, research and personal purposes. Any explicit commercialisation of the algorithms and anything inside HyperLearn is strictly prohibited.

HyperLearn promotes a free and just world. Hence, it is free to everyone, except for those who wish to commercialise on top of HyperLearn. Ongoing documentation: https://hyperlearn.readthedocs.io/en/latest/index.html [As of 2020, HyperLearn's license has been changed to BSD 3]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].