All Projects → THUDM → ProteinLM

THUDM / ProteinLM

Licence: Apache-2.0 license
Protein Language Model

Programming Languages

python
139335 projects - #7 most used programming language
C++
36643 projects - #6 most used programming language
shell
77523 projects
Cuda
1817 projects
TeX
3793 projects
c
50402 projects - #5 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to ProteinLM

finetuner
Finetuning any DNN for better embedding on neural search tasks
Stars: ✭ 442 (+481.58%)
Mutual labels:  transfer-learning, pretrained-models
sparsezoo
Neural network model repository for highly sparse and sparse-quantized models with matching sparsification recipes
Stars: ✭ 264 (+247.37%)
Mutual labels:  transfer-learning, pretrained-models
super-gradients
Easily train or fine-tune SOTA computer vision models with one open source training library
Stars: ✭ 429 (+464.47%)
Mutual labels:  transfer-learning, pretrained-models
ObjectNet
PyTorch implementation of "Pyramid Scene Parsing Network".
Stars: ✭ 15 (-80.26%)
Mutual labels:  transfer-learning, pretrained-models
Bert Keras
Keras implementation of BERT with pre-trained weights
Stars: ✭ 820 (+978.95%)
Mutual labels:  transfer-learning, pretrained-models
Imagenet
Pytorch Imagenet Models Example + Transfer Learning (and fine-tuning)
Stars: ✭ 134 (+76.32%)
Mutual labels:  transfer-learning, pretrained-models
Farm
🏡 Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.
Stars: ✭ 1,140 (+1400%)
Mutual labels:  transfer-learning, pretrained-models
Open-Source-Models
Address book for computer vision models.
Stars: ✭ 30 (-60.53%)
Mutual labels:  transfer-learning, pretrained-models
gap-text2sql
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training
Stars: ✭ 83 (+9.21%)
Mutual labels:  pretrained-models
transfer-learning-text-tf
Tensorflow implementation of Semi-supervised Sequence Learning (https://arxiv.org/abs/1511.01432)
Stars: ✭ 82 (+7.89%)
Mutual labels:  transfer-learning
ulm-basenet
Implementation of ULMFit algorithm for text classification via transfer learning
Stars: ✭ 94 (+23.68%)
Mutual labels:  transfer-learning
pytorch cnn trainer
A Simple but Powerful CNN Trainer For PyTorch
Stars: ✭ 26 (-65.79%)
Mutual labels:  transfer-learning
Land-Cover-Classification-using-Sentinel-2-Dataset
Application of deep learning on Satellite Imagery of Sentinel-2 satellite that move around the earth from June, 2015. This image patches can be trained and classified using transfer learning techniques.
Stars: ✭ 36 (-52.63%)
Mutual labels:  transfer-learning
brand-sentiment-analysis
Scripts utilizing Heartex platform to build brand sentiment analysis from the news
Stars: ✭ 21 (-72.37%)
Mutual labels:  transfer-learning
image-background-remove-tool
✂️ Automated high-quality background removal framework for an image using neural networks. ✂️
Stars: ✭ 767 (+909.21%)
Mutual labels:  transfer-learning
Skin Lesions Classification DCNNs
Transfer Learning with DCNNs (DenseNet, Inception V3, Inception-ResNet V2, VGG16) for skin lesions classification
Stars: ✭ 47 (-38.16%)
Mutual labels:  transfer-learning
Warehouse Robot Path Planning
A multi agent path planning solution under a warehouse scenario using Q learning and transfer learning.🤖️
Stars: ✭ 59 (-22.37%)
Mutual labels:  transfer-learning
neuro-evolution
A project on improving Neural Networks performance by using Genetic Algorithms.
Stars: ✭ 25 (-67.11%)
Mutual labels:  transfer-learning
SimPLE
Code for the paper: "SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification"
Stars: ✭ 50 (-34.21%)
Mutual labels:  transfer-learning
cups-rl
Customisable Unified Physical Simulations (CUPS) for Reinforcement Learning. Experiments run on the ai2thor environment (http://ai2thor.allenai.org/) e.g. using A3C, RainbowDQN and A3C_GA (Gated Attention multi-modal fusion) for Task-Oriented Language Grounding (tasks specified by natural language instructions) e.g. "Pick up the Cup or else"
Stars: ✭ 38 (-50%)
Mutual labels:  transfer-learning

ProteinLM

We pretrain protein language model based on Megatron-LM framework, and then evaluate the pretrained model results on TAPE (Tasks Assessing Protein Embeddings), which contains a set of five biologically relevant semi-supervised learning tasks. And our pretrained model achieved good performance on these tasks.

Overview

The proposal of pre-training models such as Bert have greatly promoted the development of natural language processing, improving the performance of language models. Inspired by the similarity of amino acid sequence and text sequence, we consider applying the method of pre-training language model to biological data.

Guidance

We provide pretrain and finetune code in two separate folders. If you use the pretrained model we provide, you can simply download the checkpoint and follow the finetune guide. If you want to pretrain your own model yourself, you can refer to the pretrain guide.

Download ProteinLM

ProteinLM (200M)

For the pretrained model with 200 million parameters, you can download model checkpoint via GoogleDrive, or TsinghuaCloud.

ProteinLM (3B)

For the pretrained model with 3 billion parameters, you can download model checkpoint from here.

Project Structure

.
├── pretrain                (protein language model pretrain)
│   ├── megatron            (model folder)
│   ├── pretrain_tools      (multi-node pretrain)
│   ├── protein_tools       (data preprocess shells)
└── tape
    ├── conda_env           (conda env in yaml format)
    ├── converter           (converter script and model config files)
    ├── scripts             (model generator, finetune)
    └── tape                (tape model)

Usage

As the structure above shows, there are two stages as follows.

  • Pretrain
    • Prepare dataset (PFAM)
    • Preprocess data
    • Pretrain
  • Finetune
    • Convert pretrain protein model checkpoint
    • Finetune on downstream tasks

Detailed explanations are given in each folder's readme.

Downstream Tasks Performance

Task Metric TAPE ProteinLM (200M) ProteinLM (3B)
contact prediction P@L/5 0.36 0.52 0.75
remote homology Top 1 Accuracy 0.21 0.26 0.30
secondary structure Accuracy (3-class) 0.73 0.75 0.79
fluorescence Spearman's rho 0.68 0.68 0.68
stability Spearman's rho 0.73 0.77 0.79

Citation

Please cite our paper if you find our work useful for your research. Our paper is can be accessed here.

@article{DBLP:journals/corr/abs-2108-07435,
  author    = {Yijia Xiao and
               Jiezhong Qiu and
               Ziang Li and
               Chang{-}Yu Hsieh and
               Jie Tang},
  title     = {Modeling Protein Using Large-scale Pretrain Language Model},
  journal   = {CoRR},
  volume    = {abs/2108.07435},
  year      = {2021},
  url       = {https://arxiv.org/abs/2108.07435},
  eprinttype = {arXiv},
  eprint    = {2108.07435},
  timestamp = {Fri, 20 Aug 2021 13:55:54 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2108-07435.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Contact

If you have any problem using ProteinLM, feel free to contact via [email protected].

Reference

Our work is based on the following papers. And part of the code is based on Megatron-LM and TAPE.

Evaluating Protein Transfer Learning with TAPE

@article{DBLP:journals/corr/abs-1909-08053,
  author    = {Mohammad Shoeybi and
               Mostofa Patwary and
               Raul Puri and
               Patrick LeGresley and
               Jared Casper and
               Bryan Catanzaro},
  title     = {Megatron-LM: Training Multi-Billion Parameter Language Models Using
               Model Parallelism},
  journal   = {CoRR},
  volume    = {abs/1909.08053},
  year      = {2019},
  url       = {http://arxiv.org/abs/1909.08053},
  archivePrefix = {arXiv},
  eprint    = {1909.08053},
  timestamp = {Tue, 24 Sep 2019 11:33:51 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1909-08053.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

@article{DBLP:journals/corr/abs-1906-08230,
  author    = {Roshan Rao and
               Nicholas Bhattacharya and
               Neil Thomas and
               Yan Duan and
               Xi Chen and
               John F. Canny and
               Pieter Abbeel and
               Yun S. Song},
  title     = {Evaluating Protein Transfer Learning with {TAPE}},
  journal   = {CoRR},
  volume    = {abs/1906.08230},
  year      = {2019},
  url       = {http://arxiv.org/abs/1906.08230},
  archivePrefix = {arXiv},
  eprint    = {1906.08230},
  timestamp = {Sat, 23 Jan 2021 01:20:25 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1906-08230.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].