All Projects → Shen-Lab → gcWGAN

Shen-Lab / gcWGAN

Licence: GPL-3.0 license
Guided Conditional Wasserstein GAN for De Novo Protein Design

Programming Languages

Roff
2310 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to gcWGAN

lightdock
Protein-protein, protein-peptide and protein-DNA docking framework based on the GSO algorithm
Stars: ✭ 110 (+189.47%)
Mutual labels:  protein-structure, protein, protein-design
r3dmol
🧬 An R package for visualizing molecular data in 3D
Stars: ✭ 45 (+18.42%)
Mutual labels:  protein-structure, protein
ddpm-proteins
A denoising diffusion probabilistic model (DDPM) tailored for conditional generation of protein distograms
Stars: ✭ 55 (+44.74%)
Mutual labels:  protein-structure, generative-model
deepblast
Neural Networks for Protein Sequence Alignment
Stars: ✭ 29 (-23.68%)
Mutual labels:  protein-structure, protein
FunFolDesData
Rosetta FunFolDes – a general framework for the computational design of functional proteins.
Stars: ✭ 15 (-60.53%)
Mutual labels:  protein-structure, protein-design
VSCoding-Sequence
VSCode Extension for interactively visualising protein structure data in the editor
Stars: ✭ 41 (+7.89%)
Mutual labels:  protein-structure, protein
FLIP
A collection of tasks to probe the effectiveness of protein sequence representations in modeling aspects of protein design
Stars: ✭ 35 (-7.89%)
Mutual labels:  protein, protein-design
Psgan
Periodic Spatial Generative Adversarial Networks
Stars: ✭ 108 (+184.21%)
Mutual labels:  generative-model, generative-adversarial-networks
Jupyter Dock
Jupyter Dock is a set of Jupyter Notebooks for performing molecular docking protocols interactively, as well as visualizing, converting file formats and analyzing the results.
Stars: ✭ 179 (+371.05%)
Mutual labels:  protein-structure, protein
mmterm
View proteins and trajectories in the terminal
Stars: ✭ 87 (+128.95%)
Mutual labels:  protein-structure, protein
RamaNet
Preforms De novo protein design using machine learning and PyRosetta to generate a novel protein structure
Stars: ✭ 41 (+7.89%)
Mutual labels:  protein-structure, protein-design
SeqVec
Modelling the Language of Life - Deep Learning Protein Sequences
Stars: ✭ 74 (+94.74%)
Mutual labels:  protein-structure, protein
DMPfold
De novo protein structure prediction using iteratively predicted structural constraints
Stars: ✭ 52 (+36.84%)
Mutual labels:  protein-structure, protein-sequence
cbh21-protein-solubility-challenge
Template with code & dataset for the "Structural basis for solubility in protein expression systems" challenge at the Copenhagen Bioinformatics Hackathon 2021.
Stars: ✭ 15 (-60.53%)
Mutual labels:  protein-structure, protein
Deep Generative Models For Natural Language Processing
DGMs for NLP. A roadmap.
Stars: ✭ 185 (+386.84%)
Mutual labels:  generative-model, generative-adversarial-networks
Biopython
Official git repository for Biopython (originally converted from CVS)
Stars: ✭ 2,936 (+7626.32%)
Mutual labels:  protein-structure, protein
Pytorch Pix2pix
Pytorch implementation of pix2pix for various datasets.
Stars: ✭ 74 (+94.74%)
Mutual labels:  generative-model, generative-adversarial-networks
Giqa
Pytorch implementation of Generated Image Quality Assessment
Stars: ✭ 100 (+163.16%)
Mutual labels:  generative-model, generative-adversarial-networks
MolArt
MOLeculAR structure annoTator
Stars: ✭ 25 (-34.21%)
Mutual labels:  protein-structure, protein-sequence
EVE
Official repository for the paper "Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning". Joint collaboration between the Marks lab and the OATML group.
Stars: ✭ 37 (-2.63%)
Mutual labels:  protein, generative-model

DeNovoFoldDesign

Motivation: Facing data quickly accumulating on protein sequence and structure, this study is addressingthe following question: to what extent could current data alone reveal deep insights into the sequence-structure relationship, such that new sequences can be designed accordingly for novel structure folds?

Results: We have developed novel deep generative models, constructed low-dimensional andgeneralizable representation of fold space, exploited sequence data with and without paired structures,and developed ultra-fast fold predictor as an oracle providing feedback. The resulting semi-supervisedgcWGAN is assessed with the oracle over 100 novel folds not in the training set and found to generatemore yields and cover 3.6 times more target folds compared to a competing data-driven method (cVAE).Assessed with structure predictor over representative novel folds (including one not even part of basisfolds), gcWGAN designs are found to have comparable or better fold accuracy yet much more sequencediversity and novelty than cVAE. gcWGAN explores uncharted sequence space to design proteins bylearning from current sequence-structure data. The ultra fast data-driven model can be a powerful additionto principle-driven design methods through generating seed designs or tailoring sequence space.

Training-Process


Pre-requisite

* Anaconda 2.

* Environments:

To build the enviroments for this project, go to the Environments folder, then run

conda env create -f tensorflow_training.yml
conda env create -f DeepDesign_acc.yml

For the oracle (modified DeepSF), add the following function in the file <path where keras was installed>/keras/activations.py.

def leakyrelu(x, alpha=0.1, max_value=None):
    return K.relu(x, alpha=alpha, max_value=max_value)

* Backend of Keras:

In this project we utilized two backends of keras, the theano and tensorflow, which can be set in the file /.keras/keras.json.

  • When training gcWGAN, set the backend to be tensorflow as follows:
{
    "epsilon": 1e-07,
    "floatx": "float32",
    "image_dim_ordering":"tf",
    "image_data_format": "channels_last",
    "backend": "tensorflow"
}
  • Otherwise (training cWGAN, pretraining and generating sequences), set the backend to be theano as follows:
{
    "epsilon": 1e-07,
    "floatx": "float32",
    "image_dim_ordering":"tf",
    "image_data_format": "channels_last",
    "backend": "theano"
}

* Check Points:

  • To train the model (For cWGAN and gcWGAN): Directly go to the cWGAN or gcWGAN model and follow the instructions.
  • Apply our model for evaluation or sequence generation (For Model_Test and Model_Evaluation): Go to the Checkpoints folder and download the related check points into the correct path according to the instruction.

Our check points were gotten after 100 epoch of training. If you have already downloaded our check points but want to retrain the model with the same hyper-parameters, the downlowded ones may be replaced if the training process reach the 100th epoch.


Table of contents:

  • Environments: Contain the *.yml with which you can build required environments.
  • Data: Contain the original data, processed data and ralated processing scripts.
  • Oracle: Contain the scripts for the sequence features and applying the oracles.
  • cWGAN: Contain the scripts for cWGAN model training and validation (hpper-parameter tuning).
  • gcWGAN: Contain the scripts for gcWGAN model training.
  • Model_Evaluation: Contain the scripts for model performance evaluation.
  • Model_Apply: Contain the scripts to apply the trained model.
  • Generated_Results: Contain the sequence samples generated by our model for the evaluation part (except the yiled ratio part which can be too large to upload) and the selected structure prediction from Rosetta based on gcWGAN.

Model Application

In this part you can apply our models to generate protein sequences according to a given protein fold (*.pdb file). With the scripts you can represent the givern fold with a 20 dimensional vector and send it to the generator for sequence generation. Go to the Model_Apply folder for more dtails.

Some examples of the generated sequences (10 sequences based on gcWGAN that pass the oracle):

>1
MIAPDQTIEKYVKFMAPVFTTTEYLKIVEMEEKGITTIAHGPVIHTARNPYAEVRLVSVTHELLIELQASGFLNISKTICLFETGIDENKEVLIDKDDYKEEPLLVDLFLEMEGPMDGQEIMTKLVRVPVMGQSLKPYAVKKAGVIKSAKHVG
>2
PCYALTVEAVENLLQAPAVRTLQKDEGLTPRLQPGIAAYASFIAGGAGCGLTRGSSDNMAKALIQEIEKTLRAVELTPATVQILVNNNEVKLPEKEKPNAIAKGILTVNLISKMDEFTKLVLVGENYTAILIDHIAKHKVGPV
>3
MCYDIAQSYLNFMMINGTVLIQTATRTLCPAVHSACRYDYIKVTAAKGNIVTDIGLMYFVRNMELVGPLMTATVAISKSIYTVQKATKETVNEMRTLQVAGTRTMFCRIYHVDMTKMMMQTGISIVGEKKPTRHDAEITYDQLAGHLVPLAHLKKL
>4
CTKAQRGVHKIYEVEKNYMPNRTLGDPNSLRIDSIGIRPVNERKDNTRYVAKKAKAILAKKDIMYCLPINIDVVKVTSTLDNYLDGDPYSKRPRFDDNLIKAVIPTDVALKPSPRYDVQAGRETPPAYTAVVQRFFSVKLNRL
>5
CPNVYQKLLYSMTEGPMDIGPVEVGQLLAVIPSAIGKVVSEITTSVHPAAPFEEAARVTAMAQRAALQYSTQTYLVGKESIALMYGKYRALHQDLARMVLADGQTADVQEVVPIIADIQRMHPAGQVAPRLIESGVVTASVLMTAA
>6
LLHGKLEVFHKCVAKADEASGLTFFHCGCSAYVTSEAAKGRYRPRACSTVHYFEKGATIPGLQYTNMYENAMVCTSKIRIYLEAMNMAPNVPLHRAAKYDNVSAALTANNNKVALIAEYYVTALLEGEVTQHLEEYKKNPPPELYEEIC
>7
MNKINIKYCPFNFNKVFRKEAFITQMAGENMAVLKELSEQIDHCSCFHKNTARQLLHRAEDGPVTEVETLLELRAAMICCFRRRAPRLVLGSSMSTTVITKCIAICTGQPYPGNGPPTTLGQPACSGVEVINNQAAIVIQTVEQRFILMTPGK
>8
CTVTAVQEFTENYGGLPLYVTRNQTLAPADKRLTPRYAGNFPEGAEVPAPNLAQTSPGVTYGKNIGRYLKNGLPDVAICTSPNLNLSGAYPDIVKYNYQQPEVFIRQYHPGNEMDVVKALEQFSSELLPGKTMSIVVNSYNNLADK
>9
CETTIDIEASVISQVIAVIVALTPIHKYAHASSKALASGASDVNVGPKLVAYIGKIAYSDPPIDLIPPVKVVVALLAPELAGVTAADYISYNEGKPATGESAGNAAFADGTTTIAPQRTIYEGEHKARINIITIADGAPLGSHEIP
>10
PEPDLVLTCTNLSFSAMVSCLRETSAFAGVEYAYNGIHPAGSCCLAAMKKGFFPHTEGMNALVIEPTPPVPCAPTKDLVQNKIQKAKLLPPAATTADEYSETLGQEDFLKLLTNPKITEKKKSPTTLILVTVNSELMISPVYFTGPLMKELLYHCNGEN

Training Process:

In the cWGAN folder and gcWGAN folder there are scripts for traning our two models. For cWGAN there are also scripts for validation (hyper-parameter tuning) and for gcWGAN there are also scripts for the Warmstart. Go to the cWGAN folder or gcWGAN folder for more details.

Some examples of the generated sequences during the training process:

fold a.39: vvaitfdnvhfpcshapltkaltvkklqvsannvsllvfddakmtkkidiekaikgfymmknnpqaqleiierftpttrgkpvikpiasftltspeilgkegykk!!!!!!!!!!!!!!!!!!!itkmlidavks!!!!!!!!!!!!!!!!!!!!!!!!!
fold d.78: leemskvgntpaltyreardvavigifnngkqmksrddvtdeaddyqceidpisnllelgallpplhvaetkmllyykneakmhlfegag!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
fold d.240: tlaedippklpleveqcneiivdaqnkryvavgealllitcpmlqnnsmsttcgyrfeakskdgvicespeeglqndtthyachkraaavqiptekkttvyrlhacttklegcaeadnrvladvgldgivqravcdivttfsaevnp!!!!!!!!!!!!!
fold d.227: sckpglplvcagkkstyleklltgylvyslladyispkaleeavisekkpniampafatmpslvaddvtaliakkglqnaakcpndhmeiyeaeedpaiigqgynkhqgvgcnivvmagaipdeqkvenlrsliei!!!!!!!!!!!!!!!!!!!!!!!!
fold d.301: mtakstvqlpaeykgqniaeilnnvafnlaaivysattivayramacfpcgeknykeilgkvltlfidkhpiqnnr!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
fold d.223: mqtyeeavtlgltneqqtgknvtpiniaeekllvtnglvcqapalpvneevliklsentdnikpllciigkkseaispcsfraeeafdrsadymankatimcrkgnyaiilhsdgeellaihqtsgviirlghvpgkknrymppgaliplcngp!!!!!!
fold a.216: eelakrmiqrapdveligknkiatelkrlcllirgqtaanimnvillcataisvipkkskpasqyeetvnpadlakeiilqekkeaftriltteylvtsllkmypvhkvpkp!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
fold b.60: qpifvtykrlnrlallkshplhkdpkyltavlvmeldpsslpvavqpqrvvtiqsccpiiepsappeecdiqapnklkallendkptsqn!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
fold c.9: ayfereelilltpthggnepktldiptnlpakilgtplrkvklasqkellgeahpnnavstlideayylgdeqrevvvlteqekkagpidithyvngtegsckkpnisdsptphakafkqilkemqariqhhkelittalerlkn!!!!!!!!!!!!!!!
fold a.180: mpeecvctglepgevrrqngvipllnqgfhavltpagktylccttatknqvivhmfcqtaaeniyaeitvsylrtaatstylefmkhccqnvssihygiymslmdllkeyvveklv!e!!!!!!!!!!!!!!!!!!!iaeqipearkyaaalvg!!!!!!

Evaluate Model Performance:

This part contains the scripts we applied to evaluate the performance of our model. We also generate several sequences with the previouse state-of-art model cVAE and applied our evaluation method for comparison. Model evalustion consists of three part, model accuracy, sequence generating rate and sequence diversity and novelty, and for model accuracy we applied yield ratio calculation for all the training, validation and test folds. Go to the Model_Evaluation folder for more details.


Citation:

@article{gcWGAN,
author = {Karimi, Mostafa and Zhu, Shaowen and Cao, Yue and Shen, Yang},
title = {De Novo Protein Design for Novel Folds Using Guided Conditional Wasserstein Generative Adversarial Networks},
journal = {Journal of Chemical Information and Modeling},
volume = {60},
number = {12},
pages = {5667-5681},
year = {2020},
doi = {10.1021/acs.jcim.0c00593},
note ={PMID: 32945673},
URL = {https://doi.org/10.1021/acs.jcim.0c00593},
eprint = {https://doi.org/10.1021/acs.jcim.0c00593}
}

Contacts:

Yang Shen: [email protected]

Mostafa Karimi: [email protected]

Shaowen Zhu: [email protected]

Yue Cao: [email protected]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].