All Projects → PaccMann → paccmann_proteomics

PaccMann / paccmann_proteomics

Licence: MIT license
PaccMann models for protein language modeling

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to paccmann proteomics

TRAR-VQA
[ICCV 2021] TRAR: Routing the Attention Spans in Transformers for Visual Question Answering -- Official Implementation
Stars: ✭ 49 (+75%)
Mutual labels:  transformer
KitanaQA
KitanaQA: Adversarial training and data augmentation for neural question-answering models
Stars: ✭ 58 (+107.14%)
Mutual labels:  transformer
CrabNet
Predict materials properties using only the composition information!
Stars: ✭ 57 (+103.57%)
Mutual labels:  transformer
FasterTransformer
Transformer related optimization, including BERT, GPT
Stars: ✭ 1,571 (+5510.71%)
Mutual labels:  transformer
tensorflow-ml-nlp-tf2
텐서플로2와 머신러닝으로 시작하는 자연어처리 (로지스틱회귀부터 BERT와 GPT3까지) 실습자료
Stars: ✭ 245 (+775%)
Mutual labels:  transformer
learningspoons
nlp lecture-notes and source code
Stars: ✭ 29 (+3.57%)
Mutual labels:  transformer
ProteinLM
Protein Language Model
Stars: ✭ 76 (+171.43%)
Mutual labels:  protein-language-model
ICON
(TPAMI2022) Salient Object Detection via Integrity Learning.
Stars: ✭ 125 (+346.43%)
Mutual labels:  transformer
h-transformer-1d
Implementation of H-Transformer-1D, Hierarchical Attention for Sequence Learning
Stars: ✭ 121 (+332.14%)
Mutual labels:  transformer
semantic-segmentation
SOTA Semantic Segmentation Models in PyTorch
Stars: ✭ 464 (+1557.14%)
Mutual labels:  transformer
DolboNet
Русскоязычный чат-бот для Discord на архитектуре Transformer
Stars: ✭ 53 (+89.29%)
Mutual labels:  transformer
german-sentiment
A data set and model for german sentiment classification.
Stars: ✭ 37 (+32.14%)
Mutual labels:  transformer
sticker2
Further developed as SyntaxDot: https://github.com/tensordot/syntaxdot
Stars: ✭ 14 (-50%)
Mutual labels:  transformer
Highway-Transformer
[ACL‘20] Highway Transformer: A Gated Transformer.
Stars: ✭ 26 (-7.14%)
Mutual labels:  transformer
SegFormer
Official PyTorch implementation of SegFormer
Stars: ✭ 1,264 (+4414.29%)
Mutual labels:  transformer
basis-expansions
Basis expansion transformers in sklearn style.
Stars: ✭ 74 (+164.29%)
Mutual labels:  transformer
Graphormer
Graphormer is a deep learning package that allows researchers and developers to train custom models for molecule modeling tasks. It aims to accelerate the research and application in AI for molecule science, such as material design, drug discovery, etc.
Stars: ✭ 1,194 (+4164.29%)
Mutual labels:  transformer
Kevinpro-NLP-demo
All NLP you Need Here. 个人实现了一些好玩的NLP demo,目前包含13个NLP应用的pytorch实现
Stars: ✭ 117 (+317.86%)
Mutual labels:  transformer
NLP-paper
🎨 🎨NLP 自然语言处理教程 🎨🎨 https://dataxujing.github.io/NLP-paper/
Stars: ✭ 23 (-17.86%)
Mutual labels:  transformer
TDRG
Transformer-based Dual Relation Graph for Multi-label Image Recognition. ICCV 2021
Stars: ✭ 32 (+14.29%)
Mutual labels:  transformer

Paccmann Proteomics

PaccMann Protein Language Modeling for: Protein Classification, Protein-Protein Binding and Protein Sequence Annotation Tasks.

Life science practitioners are drowning in unlabeled protein sequences. Natural Language Processing (NLP) community has recently embraced self-supervised learning as a powerful approach to learn representations from unlabeled text, in large part due to the attention-based context-aware Transformer models. In a transfer learning fashion, expensive pre-trained universal embeddings can be rapidly fine-tuned to multiple downstream prediction tasks.

In this work we present a modification to the RoBERTa model by inputting a mixture of binding and non-binding protein sequences (from STRING database) during pre-training with the Masked Language Modeling (MLM) objective.

Next, we compress protein sequences by 64% with a Byte Pair Encoding (BPE) vocabulary consisting of 10K tokens, each 3-4 amino acids long.

Finally, to expand the model input space to even larger proteins and multi-protein assemblies, we pre-train Longformer models that support 2,048 tokens.

Our approach produces excellent fine-tuning results for protein-protein binding prediction, TCR-epitope binding prediction, cellular-localization and remote homology classification tasks. We suggest that the Transformer's attention mechanism contributes to protein binding site discovery.

Installation

Use conda:

conda env create -f conda.yml

Activate the environment:

conda activate paccmann_proteomics

You are good to go and paccmann_proteomics is installed in editable mode for development:

import paccmann_proteomics

Model Architecture

alt text

An examplary RoBERTa architecture is pre-trained by a mixture of binding and non-binding protein sequences, using only the MLM objective. Byte-pair encoding with a 10k token vocabulary enables inputting 64% longer protein sequences compared to character level embedding. E_i and T_i represent input and contextual embeddings for token i. [CLS] is a special token for classification-task output, while [SEP] separates two non-consecutive sequences.

Training Scripts

Scripts for model training and evaluation described in the pre-print can be found here. Related configuration files can be found here.

Launch language modeling pretraining script with a bash command from scripts directory:

bash run_language_modeling_script.sh (might need to adjust paths to your data and already pre-trained model checkpoints, if necessary)

Launch sequence level finetuning task with:

bash run_seq_clf_script.sh

Launch token level classification/annotation finetuning task with:

bash run_token_clf_script.sh

Pre-trained models and Data

Pre-trained models and prepared datasets are available at: https://ibm.ent.box.com/v/paccmann-proteomics-data, see data/pretraining and data/fine_tuning for model pre-training datasets (SwissProt, Pfam, STRING) and the data for model fine-tuning tasks (localization, solubility, PPI, etc). Trained Byte Pair Encoding tokenizers available at data/tokenization

Contact us if you have further questions 😃.

Preprint

Our preprint was accepted to Machine Learning for Structural Biology (MLSB) workshop at NeurIPS 2020, and can be found here.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].