Self-Taught Data Science Playground
The repository is a collection of my self-taught notebooks for data science theories and practices. A huge effort is made to strike a balance between methodology derivation (with math) and hands-on coding. The target audience is data science practitioners (including myself) with hands-on experiences who are seeking for more in-depth understandings of machine learning algorithms and relevant statistics.
Here to visit the web site Hello, Data Science! hosting all the notebooks in nicely rendered HTML.
Notebooks Summary
notebooks/
A notebook is written in either Jupyter or R markdown.
The major programming languages used for most of the notebooks are Python and/or R.
You may find me sometimes inter-operate the two langauges in a single notebook.
This is achieved thanks to reticulate
.
- Statistics
- Machine Learning
- Natural Language Understanding
- On Subword Units
- [Contex-Free Word Embeddings]
- [Contex-Aware Word Embeddings]
- Data Engineering
- Infrastructure-as-Code: A Terraform AWS Use Case
- Serverless Deployment: AWS Lambda with HTTP API
- Programming
- Projects
Laboratory Scripts
labs/
These are quick-and-dirty scripts to explore a variety of open source machine learning tools. They may not be completed and can be messy to read.
[Optional] Setup Python Environment
To ensure reproducibility it is recommended to use pyenv
along with pyenv-virtualenv
to control both Python and package version.
pyenv
support only Linux and macOS.
For Windows user it is recommended to use conda
instead.
Install Different Python Version
To use virtualenv
with reticulate
in Rmd,
the involved Python must be installed with shared library:
PYTHON_CONFIGURE_OPTS="--enable-shared" pyenv install 3.7.0
virtualenv
Create Each notebook has different package dependencies. Here is an example to create an environment specific for the notebook on model explainability:
cd notebooks/ml/model_explain
pyenv virtualenv 3.7.0 k9-model-explain
pyenv local k9-model-explain
pip install --upgrade pip
pip install -r requirements.txt
TODO
Topics
- Machine Learning
- Factorization Machines
- Recurrent Neural Nets
- Sequence-to-Sequence Models
- GANs
- Reinforcement Learning Basics
- Approximated Nearest Neighbor
- Statistics
- Law of Large Numbers and Central Limit Theorem
- On Linear Regression: Machine Learning vs Econometrics
- Linear Mixed Effects Models
- Naive Bayes
- Bayesian Model Diagnostic
- Bayesian Time Series Forecasting
- Tools/Programming
- PyTorch Hands-On
- RASA Chatbot Framework Hands-On
- Programming
- R
- Production Quality Shiny App Development
- Python
- Dash for Interactive Dashboarding
- R
- Projects
- Model Deployment with gRRC
Site
- Dockerize each notebook (for complete reproducibility and portability)?
- Tidy up dependencies for each notebook