All Projects → slipguru → adenine

slipguru / adenine

Licence: other
ADENINE: A Data ExploratioN PipelINE

Programming Languages

python
139335 projects - #7 most used programming language
javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to adenine

Complete Life Cycle Of A Data Science Project
Complete-Life-Cycle-of-a-Data-Science-Project
Stars: ✭ 140 (+833.33%)
Mutual labels:  exploratory-data-analysis, unsupervised-learning
Unsupervised-Learning-in-R
Workshop (6 hours): Clustering (Hdbscan, LCA, Hopach), dimension reduction (UMAP, GLRM), and anomaly detection (isolation forests).
Stars: ✭ 34 (+126.67%)
Mutual labels:  dimensionality-reduction, unsupervised-learning
Data-Science
Using Kaggle Data and Real World Data for Data Science and prediction in Python, R, Excel, Power BI, and Tableau.
Stars: ✭ 15 (+0%)
Mutual labels:  exploratory-data-analysis, dimensionality-reduction
Awesome Community Detection
A curated list of community detection research papers with implementations.
Stars: ✭ 1,874 (+12393.33%)
Mutual labels:  dimensionality-reduction, unsupervised-learning
Breast-cancer-risk-prediction
Classification of Breast Cancer diagnosis Using Support Vector Machines
Stars: ✭ 143 (+853.33%)
Mutual labels:  exploratory-data-analysis, pipelines
Clustering-Python
Python Clustering Algorithms
Stars: ✭ 23 (+53.33%)
Mutual labels:  clustering-algorithm
Video-Summarization-Pytorch
IMPLEMENT AAAI 2018 - Unsupervised video summarization with deep reinforcement learning (PyTorch)
Stars: ✭ 23 (+53.33%)
Mutual labels:  unsupervised-learning
State-Representation-Learning-An-Overview
Simplified version of "State Representation Learning for Control: An Overview" bibliography
Stars: ✭ 32 (+113.33%)
Mutual labels:  unsupervised-learning
machine learning from scratch matlab python
Vectorized Machine Learning in Python 🐍 From Scratch
Stars: ✭ 28 (+86.67%)
Mutual labels:  unsupervised-learning
pymde
Minimum-distortion embedding with PyTorch
Stars: ✭ 420 (+2700%)
Mutual labels:  dimensionality-reduction
k8s-knative-gitlab-harbor
Build container images with Knative + Gitlab + Harbor inside Kops cluster running on AWS
Stars: ✭ 23 (+53.33%)
Mutual labels:  pipelines
NTFk.jl
Unsupervised Machine Learning: Nonnegative Tensor Factorization + k-means clustering
Stars: ✭ 36 (+140%)
Mutual labels:  unsupervised-learning
scHPF
Single-cell Hierarchical Poisson Factorization
Stars: ✭ 52 (+246.67%)
Mutual labels:  dimensionality-reduction
walklets
A lightweight implementation of Walklets from "Don't Walk Skip! Online Learning of Multi-scale Network Embeddings" (ASONAM 2017).
Stars: ✭ 94 (+526.67%)
Mutual labels:  dimensionality-reduction
missCompare
missCompare R package - intuitive missing data imputation framework
Stars: ✭ 31 (+106.67%)
Mutual labels:  missing-data
BIFI
[ICML 2021] Break-It-Fix-It: Unsupervised Learning for Program Repair
Stars: ✭ 74 (+393.33%)
Mutual labels:  unsupervised-learning
How-to-score-0.8134-in-Titanic-Kaggle-Challenge
Solution of the Titanic Kaggle competition
Stars: ✭ 114 (+660%)
Mutual labels:  exploratory-data-analysis
online-course-recommendation-system
Built on data from Pluralsight's course API fetched results. Works with model trained with K-means unsupervised clustering algorithm.
Stars: ✭ 31 (+106.67%)
Mutual labels:  clustering-algorithm
SimCLR
Pytorch implementation of "A Simple Framework for Contrastive Learning of Visual Representations"
Stars: ✭ 65 (+333.33%)
Mutual labels:  unsupervised-learning
dbscan
DBSCAN Clustering Algorithm C# Implementation
Stars: ✭ 38 (+153.33%)
Mutual labels:  unsupervised-learning




Adenine: A data exploration pipeline

adenine is a machine learning and data mining Python library for exploratory data analysis.

The main structure of adenine can be summarized in the following 4 steps.

  1. Imputing: Does your dataset have missing entries? In the first step you can fill the missing values choosing between different strategies: feature-wise median, mean and most frequent value or k-NN imputing.

  2. Preprocessing: Have you ever wondered what would have changed if only your data have been preprocessed in a different way? Or is it data preprocessing a good idea after all? adenine includes several preprocessing procedures, such as: data recentering, Min-Max scaling, standardization and normalization. adenine also allows you to compare the results of the analysis made with different preprocessing strategies.

  3. Dimensionality Reduction: In the context of data exploration, this phase becomes particularly helpful for high dimensional data. This step includes manifold learning (such as isomap, multidimensional scaling, etc) and unsupervised feature learning (principal component analysis, kernel PCA, Bernoulli RBM, etc) techniques.

  4. Clustering: This step aims at grouping data into clusters in an unsupervised manner. Several techniques such as k-means, spectral or hierarchical clustering are offered.

The final output of adenine is a compact, textual and graphical representation of the results obtained from the pipelines made with each possible combination of the algorithms selected at each step.

adenine can run on multiple cores/machines* and it is fully scikit-learn compliant.

Installation

adenine supports Python 2.7

Pip installation

$ pip install adenine

Installing from sources

$ git clone https://github.com/slipguru/adenine
$ cd adenine
$ python setup.py install

Try Adenine

1. Create your configuration file

Start from the provided template and edit your configuration file with your favourite text editor

$ ade_run.py -c my-config-file.py
$ vim my-config-file.py
...
from adenine.utils import data_source

# --------------------------  EXPERMIENT INFO ------------------------- #
exp_tag = '_experiment'
output_root_folder = 'results'
plotting_context = 'notebook'  # one of {paper, notebook, talk, poster}
file_format = 'pdf'  # or 'png'

# ----------------------------  INPUT DATA ---------------------------- #
# Load an example dataset or specify your input data in tabular format
X, y, feat_names, index = data_source.load('iris')

# -----------------------  PIPELINES DEFINITION ------------------------ #
# --- Missing Values Imputing --- #
step0 = {'Impute': [True, {'missing_values': 'NaN',
                            'strategy': ['nearest_neighbors']}]}

# --- Data Preprocessing --- #
step1 = {'MinMax': [True, {'feature_range': [(0, 1)]}]}

# --- Unsupervised feature learning --- #
step2 = {'KernelPCA': [True, {'kernel': ['linear', 'rbf', 'poly']}],
         'Isomap': [False, {'n_neighbors': 5}],
         'MDS': [True, {'metric': True}],
         'tSNE': [False],
         'RBM': [True, {'n_components': 256}]
         }

# --- Clustering --- #
# affinity ca be precumputed for AP, Spectral and Hierarchical
step3 = {'KMeans': [True, {'n_clusters': [3, 'auto']}],
         'Spectral': [False, {'n_clusters': [3]}],
         'Hierarchical': [False, {'n_clusters': [3],
                                  'affinity': ['euclidean'],
                                  'linkage':  ['ward', 'average']}]
         }

2. Run the pipelines

$ ade_run.py my-config-file.py

3. Automatically generate beautiful publication-ready plots and textual results

$ ade_analysis.py results/ade_experiment_<TODAY>

Need more info?

Check out the project homepage

*Got large-scale data?

adenine takes advantage of mpi4py to distribute the execution of the pipelines on HPC architectures

$ mpirun -np <MPI-TASKS> --hosts <HOSTS-LIST> ade_run.py my-config-file.py

Citation

If you use adenine in a scientific publication, we would appreciate citations:

@{coming soon}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].