All Projects → minhoolee → Synopsys-Project-2017

minhoolee / Synopsys-Project-2017

Licence: BSD-3-Clause license
A deep learning based bioinformatics project on epigenetics in Type 2 Diabetes.

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to Synopsys-Project-2017

cellrank
CellRank for directed single-cell fate mapping
Stars: ✭ 222 (+1485.71%)
Mutual labels:  genetics
dee2
Digital Expression Explorer 2 (DEE2): a repository of uniformly processed RNA-seq data
Stars: ✭ 32 (+128.57%)
Mutual labels:  genetics
cutevariant
A standalone and free application to explore genetics variations from VCF file
Stars: ✭ 61 (+335.71%)
Mutual labels:  genetics
LDServer
Fast API server for calculating linkage disequilibrium
Stars: ✭ 13 (-7.14%)
Mutual labels:  genetics
GeneLab Data Processing
No description or website provided.
Stars: ✭ 32 (+128.57%)
Mutual labels:  genetics
HumanIdiogramLibrary
Resource of human chromosome schematics & images
Stars: ✭ 76 (+442.86%)
Mutual labels:  genetics
GlucoseTray
Tray Icon for displaying current BG information in taskbar.
Stars: ✭ 18 (+28.57%)
Mutual labels:  diabetes
fastbaps
A fast approximation to a Dirichlet Process Mixture model (DPM) for clustering genetic data
Stars: ✭ 24 (+71.43%)
Mutual labels:  genetics
graphsim
R package: Simulate Expression data from igraph network using mvtnorm (CRAN; JOSS)
Stars: ✭ 16 (+14.29%)
Mutual labels:  genetics
xpclr
Code to compute the XP-CLR statistic to infer natural selection
Stars: ✭ 64 (+357.14%)
Mutual labels:  genetics
manhattan generator
Manhattan plot Generator
Stars: ✭ 20 (+42.86%)
Mutual labels:  genetics
fwdpy11
Forward-time simulation in Python using fwdpp
Stars: ✭ 25 (+78.57%)
Mutual labels:  genetics
pyro-cov
Pyro models of SARS-CoV-2 variants
Stars: ✭ 39 (+178.57%)
Mutual labels:  genetics
genipe
Genome-wide imputation pipeline
Stars: ✭ 28 (+100%)
Mutual labels:  genetics
awesome-genetics
A curated list of awesome bioinformatics software.
Stars: ✭ 60 (+328.57%)
Mutual labels:  genetics
rvtests
Rare variant test software for next generation sequencing data
Stars: ✭ 114 (+714.29%)
Mutual labels:  genetics
region-plot
A tool to plot significant regions of GWAS
Stars: ✭ 20 (+42.86%)
Mutual labels:  genetics
Diaguard
Android app for diabetics
Stars: ✭ 89 (+535.71%)
Mutual labels:  diabetes
impute-me
This is the code behind the www.impute.me site. It contains algorithms for personal genome analysis, including imputation and polygenic risk score calculation
Stars: ✭ 96 (+585.71%)
Mutual labels:  genetics
Repo-Bio
Binomica Public Repository for Biological Parts
Stars: ✭ 21 (+50%)
Mutual labels:  genetics

Synopsys Project 2016-2017

Leveraging Deep Learning to Derive De Novo Epigenetic Mechanisms from PPARGC1A to Account for Missing Heritability in Type II Diabetes Mellitus

Data / Statistical Analysis Jupyter Notebooks

Until Github fixes notebook renderings, please visit the following links (same as notebooks/) https://nbviewer.jupyter.org/github/minhoolee/Synopsys-Project-2017/blob/master/notebooks/0.1-mhl-data-analysis.ipynb
https://nbviewer.jupyter.org/github/minhoolee/Synopsys-Project-2017/blob/master/notebooks/0.1-mhl-model-predictions.ipynb
https://nbviewer.jupyter.org/github/minhoolee/Synopsys-Project-2017/blob/master/notebooks/0.2-mhl-model-predictions.ipynb

Synopsys Competition Tri-Fold

The focus of the project was in using deep learning to predict novel epigenetic mechanisms like DNase I sites, histone modifications, and transcription factor binding sites from raw genomic sequences. Type II diabetes (T2D) is a common disease that affects millions of people each year, but as of today, only around 10% of its heritability has been explained. Researchers speculate that this is because epigenetics is heavily involved, so my project was designed to interpret millions of samples and hundreds of epigenetic regulators to be able understand the combinatorial effects of these epigenetic mechanisms.

I conducted this independent research project for the Synopsys science fair as a high school junior. In order to train my models, I built my own custom PC (see specs here). I would like to thank my mentor, Renee Fallon, in providing me biology textbooks and general advice.

Custom Built PC

Steps for reproducing results

Step 1. Get data

Download processed data from DeepSEA and move them to data/processed/ Data is processed in the following manner:

Data on histone modifications, DNase I hypersensitive sites, and transcription factor binding sites is collected from ChIP-seq and DNase-seq methods. This data entails 919 ChIP-seq and DNase-seq peaks from processed ENCODE and Roadmap Epigenomics data releases for GRCh37. This data is publically available to download and has been processed by the researchers of the DeepSEA framework (Zhou). The input is encoded in a 1000 x 4 binary matrix, with the columns corresponding to A, T, G, and C. The rows corresponds to the number of bp (1kbp) in a single bin that will serve as the input for a single neuron. These 1000 bp regions are centered around 200 bp sequences that contain at least one transcription factor site (400 bp sequence paddings for genome sequence context). The data is split into test, train, and validation sets, and the sets are separated based off of chromosomes in order to ensure that the model can be tested for high bias.

Step 2. Create model

Create a method in src/models/create_models.py that constructs a Keras model (sequential, functional, etc.) and then returns it.

Step 3. Train model

Run make train MODEL_FUNC='<method from step 2>' MODEL_NAME='<some unique identifier>'

Step 4. Test model and generate predictions

Run make test MODEL_FUNC='<same as from step 3>' MODEL_NAME='<same as from step 3>'

Step 5. Generating performance (ROC/PR, stdev, etc.) scores and visualizations

See notebooks/ and run the code after "Execute the following" headers. Make sure to run them with the Theano backend for Keras because the models were all trained on Theano.

Project Organization

├── LICENSE
├── Makefile                  <- Makefile with commands like `make data` or `make train`
├── README.md                 <- The top-level README for developers using this project.
├── data
│   ├── external              <- Data from third party sources.
│   ├── interim               <- Intermediate data that has been transformed.
│   ├── processed             <- The final, canonical data sets for modeling.
│   └── raw                   <- The original, immutable data dump.
│
├── docs                      <- A default Sphinx project; see sphinx-doc.org for details
│
├── models                    <- Trained and serialized models, model predictions, or model summaries
│   ├── csv                   <- CSV logs of epoch and batch runs
│   ├── json                  <- JSON representation of the models
│   ├── predictions           <- Predictions generated the train models and their best weights
│   ├── weights               <- Best weights for the models
│   └── yaml                  <- YAML representation of the models
│
├── notebooks                 <- Jupyter notebooks. Naming convention is a number (for ordering),
│                                the creator's initials, and a short `-` delimited description, e.g.
│                                `1.0-jqp-initial-data-exploration`.
│
├── references                <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports                   <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures               <- Generated graphics and figures to be used in reporting
│
├── requirements.txt          <- The requirements file for reproducing the analysis environment, e.g.
│                                generated with `pip freeze > requirements.txt`
│
├── src                       <- Source code for use in this project.
│   ├── __init__.py           <- Makes src a Python module
│   │
│   ├── data                  <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features              <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── logging               <- Scripts to improve python logging
│   │   └── log_utils.py
│   │
│   ├── models                <- Scripts to train and test models and then use trained models to make
│   │   │                         predictions
│   │   ├── create_models.py  <- Script to create a keras model and return it to train_model.py
│   │   ├── predict_model.py
│   │   ├── test_model.py
│   │   └── train_model.py
│   │
│   │── unit_tests            <- Scripts to test each unit of the other scripts
│   │
│   └── visualization         <- Scripts to create exploratory and results oriented visualizations
│       ├── plot_train_valid.py
│       ├── stats.py
│       └── visualize.py
│
└── tox.ini                   <- tox file with settings for running tox; see tox.testrun.org

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].