All Projects → flatironinstitute → deepblast

flatironinstitute / deepblast

Licence: BSD-3-Clause license
Neural Networks for Protein Sequence Alignment

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to deepblast

tape-neurips2019
Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. (DEPRECATED)
Stars: ✭ 117 (+303.45%)
Mutual labels:  protein-structure, language-modeling, protein-sequences
Biopython
Official git repository for Biopython (originally converted from CVS)
Stars: ✭ 2,936 (+10024.14%)
Mutual labels:  protein-structure, protein, sequence-alignment
SeqVec
Modelling the Language of Life - Deep Learning Protein Sequences
Stars: ✭ 74 (+155.17%)
Mutual labels:  protein-structure, protein
cbh21-protein-solubility-challenge
Template with code & dataset for the "Structural basis for solubility in protein expression systems" challenge at the Copenhagen Bioinformatics Hackathon 2021.
Stars: ✭ 15 (-48.28%)
Mutual labels:  protein-structure, protein
r3dmol
🧬 An R package for visualizing molecular data in 3D
Stars: ✭ 45 (+55.17%)
Mutual labels:  protein-structure, protein
gcWGAN
Guided Conditional Wasserstein GAN for De Novo Protein Design
Stars: ✭ 38 (+31.03%)
Mutual labels:  protein-structure, protein
lightdock
Protein-protein, protein-peptide and protein-DNA docking framework based on the GSO algorithm
Stars: ✭ 110 (+279.31%)
Mutual labels:  protein-structure, protein
mmtf-spark
Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.
Stars: ✭ 20 (-31.03%)
Mutual labels:  protein-structure, protein-sequences
plmc
Inference of couplings in proteins and RNAs from sequence variation
Stars: ✭ 85 (+193.1%)
Mutual labels:  protein-structure, protein-sequences
parapred
Paratope Prediction using Deep Learning
Stars: ✭ 49 (+68.97%)
Mutual labels:  protein-structure, protein-sequences
VSCoding-Sequence
VSCode Extension for interactively visualising protein structure data in the editor
Stars: ✭ 41 (+41.38%)
Mutual labels:  protein-structure, protein
mmterm
View proteins and trajectories in the terminal
Stars: ✭ 87 (+200%)
Mutual labels:  protein-structure, protein
pytorch-rgn
Recurrent Geometric Network in Pytorch
Stars: ✭ 28 (-3.45%)
Mutual labels:  protein-structure, protein-sequences
Jupyter Dock
Jupyter Dock is a set of Jupyter Notebooks for performing molecular docking protocols interactively, as well as visualizing, converting file formats and analyzing the results.
Stars: ✭ 179 (+517.24%)
Mutual labels:  protein-structure, protein
FLIP
A collection of tasks to probe the effectiveness of protein sequence representations in modeling aspects of protein design
Stars: ✭ 35 (+20.69%)
Mutual labels:  protein, protein-sequences
RamaNet
Preforms De novo protein design using machine learning and PyRosetta to generate a novel protein structure
Stars: ✭ 41 (+41.38%)
Mutual labels:  protein-structure
pytorch-translm
An implementation of transformer-based language model for sentence rewriting tasks such as summarization, simplification, and grammatical error correction.
Stars: ✭ 22 (-24.14%)
Mutual labels:  language-modeling
SneakySnake
SneakySnake🐍 is the first and the only pre-alignment filtering algorithm that works efficiently and fast on modern CPU, FPGA, and GPU architectures. It greatly (by more than two orders of magnitude) expedites sequence alignment calculation for both short and long reads. Described in the Bioinformatics (2020) by Alser et al. https://arxiv.org/abs…
Stars: ✭ 44 (+51.72%)
Mutual labels:  sequence-alignment
referit3d
Code accompanying our ECCV-2020 paper on 3D Neural Listeners.
Stars: ✭ 59 (+103.45%)
Mutual labels:  language-modeling
pia
📚 🔬 PIA - Protein Inference Algorithms
Stars: ✭ 19 (-34.48%)
Mutual labels:  protein

DeepBLAST

Learning protein structural similarity from sequence alone. Our preprint can be found here

Installation

DeepBLAST can be installed from pip via

pip install deepblast

To install from the development branch run

pip install git+https://github.com/flatironinstitute/deepblast.git

Downloading pretrained models and data

The pretrained DeepBLAST model can be downloaded here.

The TM-align structural alignments used to pretrain DeepBLAST can be found here

See the Malisam and Malidup websites to download their datasets.

Getting started

We have 2 command line scripts available, namely deepblast-train and deepblast-eval.

Pretraining

deepblast-train takes in as input a tab-delimited format of with columns query_seq_id | key_seq_id | tm_score1 | tm_score2 | rmsd | sequence1 | sequence2 | alignment_string See an example here of what this looks like. At this moment, we only support parsing the output of TM-align. The parsing script can be found under

deepblast/dataset/parse_tm_align.py [fname] [output_table]

Once the data is configured and split appropriately, deepblast-train can be run. The command-line options are given below (see deepblast-train --help for more details).

usage: deepblast-train [-h] [--gpus GPUS] [--grad-accum GRAD_ACCUM] [--grad-clip GRAD_CLIP] [--nodes NODES] [--num-workers NUM_WORKERS] [--precision PRECISION] [--backend BACKEND]
                       [--load-from-checkpoint LOAD_FROM_CHECKPOINT] --train-pairs TRAIN_PAIRS --test-pairs TEST_PAIRS --valid-pairs VALID_PAIRS [--embedding-dim EMBEDDING_DIM]
                       [--rnn-input-dim RNN_INPUT_DIM] [--rnn-dim RNN_DIM] [--layers LAYERS] [--loss LOSS] [--learning-rate LEARNING_RATE] [--batch-size BATCH_SIZE]
                       [--multitask MULTITASK] [--finetune FINETUNE] [--mask-gaps MASK_GAPS] [--scheduler SCHEDULER] [--epochs EPOCHS]
                       [--visualization-fraction VISUALIZATION_FRACTION] -o OUTPUT_DIRECTORY

Evaluation

This will evaluate how much the deepblast predictions agree with the structural alignments. The deepblast-train command will automatically evaluate the heldout test set if it completes. However, a separate deepblast-evaluate command is available in case the pretraining was interrupted. The commandline options are given below (see deepblast-evaluate --help for more details)

usage: deepblast-evaluate [-h] [--gpus GPUS] [--num-workers NUM_WORKERS] [--nodes NODES] [--load-from-checkpoint LOAD_FROM_CHECKPOINT] [--precision PRECISION] [--backend BACKEND]
                          --train-pairs TRAIN_PAIRS --test-pairs TEST_PAIRS --valid-pairs VALID_PAIRS [--embedding-dim EMBEDDING_DIM] [--rnn-input-dim RNN_INPUT_DIM]
                          [--rnn-dim RNN_DIM] [--layers LAYERS] [--loss LOSS] [--learning-rate LEARNING_RATE] [--batch-size BATCH_SIZE] [--multitask MULTITASK]
                          [--finetune FINETUNE] [--mask-gaps MASK_GAPS] [--scheduler SCHEDULER] [--epochs EPOCHS] [--visualization-fraction VISUALIZATION_FRACTION] -o
                          OUTPUT_DIRECTORY

Loading the models

import torch
from deepblast.trainer import LightningAligner
from deepblast.dataset.utils import pack_sequences
from deepblast.dataset.utils import states2alignment
import matplotlib.pyplot as plt
import seaborn as sns

# Load the pretrained model
model = LightningAligner.load_from_checkpoint(your_model_path)

# Load on GPU (if you want)
model = model.cuda()

# Obtain hard alignment from the raw sequences
x = 'IGKEEIQQRLAQFVDHWKELKQLAAARGQRLEESLEYQQFVANVEEEEAWINEKMTLVASED'
y = 'QQNKELNFKLREKQNEIFELKKIAETLRSKLEKYVDITKKLEDQNLNLQIKISDLEKKLSDA'
pred_alignment = model.align(x, y)
x_aligned, y_aligned = states2alignment(pred_alignment, x, y)
print(x_aligned)
print(pred_alignment)
print(y_aligned)

x_ = torch.Tensor(model.tokenizer(str.encode(x))).long()
y_ = torch.Tensor(model.tokenizer(str.encode(y))).long()

# Pack sequences for easier parallelization
seq, order = pack_sequences([x_], [y_])
seq = seq.cuda()

# Generate alignment score
score = model.aligner.score(seq, order).item()
print('Score', score)

# Predict expected alignment
A, match_scores, gap_scores = model.forward(seq, order)

# Display the expected alignment
fig, ax = plt.subplots(1, 3, figsize=(9, 3))
sns.heatmap(A.cpu().detach().numpy().squeeze(), ax=ax[0], cbar=False,  cmap='viridis')
sns.heatmap(match_scores.cpu().detach().numpy().squeeze(), ax=ax[1],  cmap='viridis')
sns.heatmap(gap_scores.cpu().detach().numpy().squeeze(), ax=ax[2],  cmap='viridis')
ax[0].set_title('Predicted Alignment')
ax[1].set_title('Match scores ($\mu$)')
ax[2].set_title('Gap scores ($g$)')
plt.tight_layout()
plt.show()

The output will look like

IGKEEIQQRLAQFVDHWKELKQLAAARGQRLEESLEYQQFVANVEEEEAWINEKMTLVASED
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
QQNKELNFKLREKQNEIFELKKIAETLRSKLEKYVDITKKLEDQNLNLQIKISDLEKKLSDA

Score 282.3163757324219

FAQ

Q : How do I interpret the alignment string?

A : The alignment string is used to indicate matches and mismatches between sequences. For example consider the following alignment

ADQSFLWASGVI-S------D-EM--
::::::::::::2:222222:2:122
MHHHHHHSSGVDLWSHPQFEKGT-EN

The first 12 residues in the alignment are matches. The last 2 characters indicate insertions in the second sequence (hence the 2 in the alignment string), and the 3rd to last character indciates an insertion in the first sequence (hence the 1 in the aligment string).

Citation

If you find our work useful, please cite us at

@article{morton2020protein,
  title={Protein Structural Alignments From Sequence},
  author={Morton, Jamie and Strauss, Charlie and Blackwell, Robert and Berenberg, Daniel and Gligorijevic, Vladimir and Bonneau, Richard},
  journal={bioRxiv},
  year={2020},
  publisher={Cold Spring Harbor Laboratory}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].