All Projects → songlab-cal → tape-neurips2019

songlab-cal / tape-neurips2019

Licence: MIT license
Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. (DEPRECATED)

Programming Languages

python
139335 projects - #7 most used programming language
TeX
3793 projects
shell
77523 projects

Projects that are alternatives of or similar to tape-neurips2019

deepblast
Neural Networks for Protein Sequence Alignment
Stars: ✭ 29 (-75.21%)
Mutual labels:  protein-structure, language-modeling, protein-sequences
mmtf-spark
Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.
Stars: ✭ 20 (-82.91%)
Mutual labels:  protein-structure, protein-sequences
parapred
Paratope Prediction using Deep Learning
Stars: ✭ 49 (-58.12%)
Mutual labels:  protein-structure, protein-sequences
plmc
Inference of couplings in proteins and RNAs from sequence variation
Stars: ✭ 85 (-27.35%)
Mutual labels:  protein-structure, protein-sequences
pytorch-rgn
Recurrent Geometric Network in Pytorch
Stars: ✭ 28 (-76.07%)
Mutual labels:  protein-structure, protein-sequences
semi-supervised-paper-implementation
Reproduce some methods in semi-supervised papers.
Stars: ✭ 35 (-70.09%)
Mutual labels:  semi-supervised-learning
Jupyter Dock
Jupyter Dock is a set of Jupyter Notebooks for performing molecular docking protocols interactively, as well as visualizing, converting file formats and analyzing the results.
Stars: ✭ 179 (+52.99%)
Mutual labels:  protein-structure
realistic-ssl-evaluation-pytorch
Reimplementation of "Realistic Evaluation of Deep Semi-Supervised Learning Algorithms"
Stars: ✭ 79 (-32.48%)
Mutual labels:  semi-supervised-learning
Uni-Fold
An open-source platform for developing protein models beyond AlphaFold.
Stars: ✭ 227 (+94.02%)
Mutual labels:  protein-structure
Bio3DView.jl
A Julia package to view macromolecular structures in the REPL, in a Jupyter notebook/JupyterLab or in Pluto
Stars: ✭ 30 (-74.36%)
Mutual labels:  protein-structure
rankpruning
🧹 Formerly for binary classification with noisy labels. Replaced by cleanlab.
Stars: ✭ 81 (-30.77%)
Mutual labels:  semi-supervised-learning
Pro-GNN
Implementation of the KDD 2020 paper "Graph Structure Learning for Robust Graph Neural Networks"
Stars: ✭ 202 (+72.65%)
Mutual labels:  semi-supervised-learning
sidechainnet
An all-atom protein structure dataset for machine learning.
Stars: ✭ 227 (+94.02%)
Mutual labels:  protein-structure
MolArt
MOLeculAR structure annoTator
Stars: ✭ 25 (-78.63%)
Mutual labels:  protein-structure
ST-PlusPlus
[CVPR 2022] ST++: Make Self-training Work Better for Semi-supervised Semantic Segmentation
Stars: ✭ 168 (+43.59%)
Mutual labels:  semi-supervised-learning
mmterm
View proteins and trajectories in the terminal
Stars: ✭ 87 (-25.64%)
Mutual labels:  protein-structure
sequencework
programs and scripts, mainly python, for analyses related to nucleic or protein sequences
Stars: ✭ 22 (-81.2%)
Mutual labels:  protein-sequences
metric-transfer.pytorch
Deep Metric Transfer for Label Propagation with Limited Annotated Data
Stars: ✭ 49 (-58.12%)
Mutual labels:  semi-supervised-learning
sinkhorn-label-allocation
Sinkhorn Label Allocation is a label assignment method for semi-supervised self-training algorithms. The SLA algorithm is described in full in this ICML 2021 paper: https://arxiv.org/abs/2102.08622.
Stars: ✭ 49 (-58.12%)
Mutual labels:  semi-supervised-learning
enspara
Modeling molecular ensembles with scalable data structures and parallel computing
Stars: ✭ 28 (-76.07%)
Mutual labels:  protein-structure

Tasks Assessing Protein Embeddings (TAPE)

Note: This is the code associated with our original paper and benchmark. To view our new pytorch code, see https://github.com/songlab-cal/tape.

Data, weights, and code for running the TAPE benchmark on a trained protein embedding. We provide a pretraining corpus, five supervised downstream tasks, pretrained language model weights, and benchmarking code.

Contents

Paper

Preprint is available at https://arxiv.org/abs/1906.08230.

Data

Data should be placed in the ./data folder, although you may also specify a different data directory if you wish.

The supervised data is around 120MB compressed and 900 MB uncompressed. The unsupervised Pfam dataset is around 5GB compressed and 40GB uncompressed. The data for training is hosted on AWS. By default we provide data as TFRecords - see ./tape/data_utils/ for deserializers for each dataset and documentation of data keys. If you wish to download all of TAPE, run download_data.sh to do so. We also provide links to each individual dataset below in both TFRecord format and JSON format.

TFRecord Data

Pretraining Corpus (Pfam) | Secondary Structure | Contact (ProteinNet) | Remote Homology | Fluorescence | Stability

Raw Data

Raw data files are stored in JSON format for maximum portability. These are larger than the serialized TFRecord files (on average 3x larger). For all tasks except proteinnet we directly provide the output of our TFRecord parsing function on the file. For the proteinnet task we don't directly provide contact maps (as these massively increase the size of the files) and instead provide the 3D positions of all Carbon-alpha atoms. Note that this is in fact what is stored in the TFRecord files as well - our parsing function constructs the contact maps from this information on-the-fly.

Pretraining Corpus (Pfam) | Secondary Structure | Contact (ProteinNet) | Remote Homology | Fluorescence | Stability

Pretrained Models

We provide weights for all models pretrained as detailed in the paper. Each set of weights comes in an h5 file and is roughly 100 MB. If you wish to download all models, run download_pretrained_models.sh to do so. We also provide links to each individual model's weights below:

LSTM | Transformer | ResNet | UniRep (mLSTM) | Bepler Only Unsupervised | Bepler Unsupervised + MultiTask

UniRep is described in Alley et al. Bepler refers to the models in Bepler et al.

Code Setup

We recommend that you install tape into a python virtual environment using

$ pip install -e .

Usage

tape uses Sacred to configure and store logging information.

Sacred options are specified by running tape with <args>. For example, to run the transformer model on the masked_language_modeling task, simply run

$ tape with model=transformer tasks=masked_language_modeling

Additional arguments can be specified by adding e.g. transformer.n_layers=6, training.learning_rate=1e-4, gpu.device=0,1,2, etc.

Global arguments are defined under @tape.config in tape/__main__.py. Model specific arguments (e.g. transformer.n_layers) can be found in the corresponding model file (tape/models/Transformer.py).

Loading a Model

There are two ways of loading a model, depending on whether you want to load the unsupervised pre-training weights or the supervised task-specific weights. Loading unsupervised weights is done by passing the argument load_from=</path/to/unsupervised_weights.h5>. Loading supervised weights is done by passing the argument load_task_from=</path/to/supervised_weights.h5>.

Saving Results

Results will be stored in results/. Each run will be placed in a timestamped directory. All tape sources will automatically be saved, along with the config and per-epoch metrics.

Running the trained Task Model

Once you've trained your task model, you can run an evaluation step like this, passing your test set (as tfrecords) to --datafile:

$ tape-eval results/<task-name>-<model>-<time-stamp>/ --datafile data/remote_homology/remote_homology_test_fold_holdout.tfrecord

This will report the key accuracy metric on your dataset, as well as save the outputs of the model to results/<task-name>-<model>-<time-stamp>/outputs.pkl for more detailed analysis

List of Models and Tasks

The available models are:

  • transformer
  • resnet
  • lstm
  • bepler
  • unirep
  • one_hot
  • one_hot_evolutionary

The available standard tasks are:

  • contact_map
  • stability
  • fluorescence
  • language_modeling
  • masked_language_modeling
  • remote_homology
  • secondary_structure

Additionally, we have some Unirep and Bepler specific tasks:

  • bepler_language_modeling: for unsupervised pre-training of the Bepler model.
  • unidirectional_language_modeling: for unsupervised pre-training of the Unirep model.

Finally we also provide the netsurf task, which does the full multi-task Netsurf training described in the original paper. This is done on the same dataset as secondary structure.

The available models and tasks can be found in tape/models/ModelBuilder.py and tape/tasks/TaskBuilder.py.

Extracting Embeddings

If you would just like the embeddings for a list of proteins,

$ tape-embed <filename>.fasta --model <model> --load-from <pretrained-weights>.h5

You can also go from tfrecords, by first converting your fasta to tfrecords

$ tape-serialize <filename>.fasta

This will create a new serialized file in the same directory <filename>.tfrecord. You can then extract the embeddings with

$ tape-embed <filename>.tfrecord --model <model> --load-from <pretrained-weights>.h5

Which will create a file outputs.pkl in your current directory with the list of embeddings.

Extending Tape

If you'd like to extend tape directly, it is certainly possible. Unfortunately sacred, while great for storing and logging results, is a little bit of a pain when it comes to modularity and extensibility. That being said, it is reasonably straightforward to add a Keras model into the training framework. For examples on how to do this, see the examples/ directory, which contains two files showing how to add a model, and how to add a model with hyperparameters. If there is a more specific example that would be helpful, please open an issue and we will try to add it if we have time.

Leaderboard

We will soon have a leaderboard available for tracking progress on the core five TAPE tasks, so check back for a link here. See the main tables in our paper for a sense of where performance stands at this point. Publication on the leaderboard will be contingent on meeting the following citation guidelines.

In the meantime, here's a temporary leaderboard for each task. All reported models on this leaderboard use unsupervised pretraining.

Secondary Structure

Ranking Model Accuracy (3-class)
1. One Hot + Alignment 0.80
2. LSTM 0.75
2. ResNet 0.75
4. Transformer 0.73
4. Bepler 0.73
4. Unirep 0.73
7. One Hot 0.69

Contact Prediction

Ranking Model L/5 Medium + Long Range
1. One Hot + Alignment 0.64
2. Bepler 0.40
3. LSTM 0.39
4. Transformer 0.36
5. Unirep 0.34
6. ResNet 0.29
6. One Hot 0.29

Remote Homology Detection

Ranking Model Top 1 Accuracy
1. LSTM 0.26
2. Unirep 0.23
3. Transformer 0.21
4. Bepler 0.17
4. ResNet 0.17
6. One Hot + Alignment 0.09
6. One Hot 0.09

Fluorescence

Ranking Model Spearman's rho
1. Transformer 0.68
2. LSTM 0.67
2. Unirep 0.67
4. Bepler 0.33
5. ResNet 0.21
6. One Hot 0.14

Stability

Ranking Model Spearman's rho
1. Transformer 0.73
1. Unirep 0.73
1. ResNet 0.73
4. LSTM 0.69
5. Bepler 0.64
6. One Hot 0.19

Citation Guidelines

If you find TAPE useful, please cite our corresponding paper. Additionally, anyone using the datasets provided in TAPE must describe and cite all dataset components they use. Producing these data is time and resource intensive, and we insist this be recognized by all TAPE users. For convenience,data_refs.bib contains all necessary citations. We also provide each individual citation below.

TAPE (Our paper):

@article{
author = {Rao, Roshan and Bhattacharya, Nicholas and Thomas, Neil and Duan, Yan and Chen, Xi and Canny, John and Abbeel, Pieter and Song, Yun S},
title = {Evaluating Protein Transfer Learning with TAPE},
year = {2019}
}

Pfam (Pretraining):

@article{
author = {El-Gebali, Sara and Mistry, Jaina and Bateman, Alex and Eddy, Sean R and Luciani, Aur{\'{e}}lien and Potter, Simon C and Qureshi, Matloob and Richardson, Lorna J and Salazar, Gustavo A and Smart, Alfredo and Sonnhammer, Erik L L and Hirsh, Layla and Paladin, Lisanna and Piovesan, Damiano and Tosatto, Silvio C E and Finn, Robert D},
doi = {10.1093/nar/gky995},
file = {::},
issn = {0305-1048},
journal = {Nucleic Acids Research},
keywords = {community,protein domains,tandem repeat sequences},
number = {D1},
pages = {D427--D432},
publisher = {Narnia},
title = {{The Pfam protein families database in 2019}},
url = {https://academic.oup.com/nar/article/47/D1/D427/5144153},
volume = {47},
year = {2019}
}

SCOPe: (Remote Homology and Contact)-

@article{
  title={SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures},
  author={Fox, Naomi K and Brenner, Steven E and Chandonia, John-Marc},
  journal={Nucleic acids research},
  volume={42},
  number={D1},
  pages={D304--D309},
  year={2013},
  publisher={Oxford University Press}
}

PDB: (Secondary Structure and Contact)

@article{
  title={The protein data bank},
  author={Berman, Helen M and Westbrook, John and Feng, Zukang and Gilliland, Gary and Bhat, Talapady N and Weissig, Helge and Shindyalov, Ilya N and Bourne, Philip E},
  journal={Nucleic acids research},
  volume={28},
  number={1},
  pages={235--242},
  year={2000},
  publisher={Oxford University Press}
}

CASP12: (Secondary Structure and Contact)

@article{
author = {Moult, John and Fidelis, Krzysztof and Kryshtafovych, Andriy and Schwede, Torsten and Tramontano, Anna},
doi = {10.1002/prot.25415},
issn = {08873585},
journal = {Proteins: Structure, Function, and Bioinformatics},
keywords = {CASP,community wide experiment,protein structure prediction},
pages = {7--15},
publisher = {John Wiley {\&} Sons, Ltd},
title = {{Critical assessment of methods of protein structure prediction (CASP)-Round XII}},
url = {http://doi.wiley.com/10.1002/prot.25415},
volume = {86},
year = {2018}
}

NetSurfP2.0: (Secondary Structure)

@article{netsurfp,
  title={NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning},
  author={Klausen, Michael Schantz and Jespersen, Martin Closter and Nielsen, Henrik and Jensen, Kamilla Kjaergaard and Jurtz, Vanessa Isabell and Soenderby, Casper Kaae and Sommer, Morten Otto Alexander and Winther, Ole and Nielsen, Morten and Petersen, Bent and others},
  journal={Proteins: Structure, Function, and Bioinformatics},
  year={2019},
  publisher={Wiley Online Library}
}

ProteinNet: (Contact)

@article{
  title={ProteinNet: a standardized data set for machine learning of protein structure},
  author={AlQuraishi, Mohammed},
  journal={arXiv preprint arXiv:1902.00249},
  year={2019}
}

Fluorescence:

@article{
  title={Local fitness landscape of the green fluorescent protein},
  author={Sarkisyan, Karen S and Bolotin, Dmitry A and Meer, Margarita V and Usmanova, Dinara R and Mishin, Alexander S and Sharonov, George V and Ivankov, Dmitry N and Bozhanova, Nina G and Baranov, Mikhail S and Soylemez, Onuralp and others},
  journal={Nature},
  volume={533},
  number={7603},
  pages={397},
  year={2016},
  publisher={Nature Publishing Group}
}

Stability:

@article{
  title={Global analysis of protein folding using massively parallel design, synthesis, and testing},
  author={Rocklin, Gabriel J and Chidyausiku, Tamuka M and Goreshnik, Inna and Ford, Alex and Houliston, Scott and Lemak, Alexander and Carter, Lauren and Ravichandran, Rashmi and Mulligan, Vikram K and Chevalier, Aaron and others},
  journal={Science},
  volume={357},
  number={6347},
  pages={168--175},
  year={2017},
  publisher={American Association for the Advancement of Science}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].