All Projects → psipred → DMPfold

psipred / DMPfold

Licence: GPL-3.0 license
De novo protein structure prediction using iteratively predicted structural constraints

Programming Languages

c
50402 projects - #5 most used programming language
python
139335 projects - #7 most used programming language
shell
77523 projects
perl
6916 projects
Makefile
30231 projects

Projects that are alternatives of or similar to DMPfold

MolArt
MOLeculAR structure annoTator
Stars: ✭ 25 (-51.92%)
Mutual labels:  protein-structure, protein-sequence
gcWGAN
Guided Conditional Wasserstein GAN for De Novo Protein Design
Stars: ✭ 38 (-26.92%)
Mutual labels:  protein-structure, protein-sequence
Jupyter Dock
Jupyter Dock is a set of Jupyter Notebooks for performing molecular docking protocols interactively, as well as visualizing, converting file formats and analyzing the results.
Stars: ✭ 179 (+244.23%)
Mutual labels:  protein-structure
PLSC
Paddle Large Scale Classification Tools,supports ArcFace, CosFace, PartialFC, Data Parallel + Model Parallel. Model includes ResNet, ViT, DeiT, FaceViT.
Stars: ✭ 113 (+117.31%)
Mutual labels:  resnet
MIT-Driverless-CV-TrainingInfra
PyTorch pipeline of MIT Driverless Computer Vision paper(2020)
Stars: ✭ 89 (+71.15%)
Mutual labels:  resnet
mmterm
View proteins and trajectories in the terminal
Stars: ✭ 87 (+67.31%)
Mutual labels:  protein-structure
LibtorchTutorials
This is a code repository for pytorch c++ (or libtorch) tutorial.
Stars: ✭ 463 (+790.38%)
Mutual labels:  resnet
tensorflow-classification-network
实现遇到的分类网络(持续更新)
Stars: ✭ 19 (-63.46%)
Mutual labels:  resnet
openfold
Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2
Stars: ✭ 1,717 (+3201.92%)
Mutual labels:  protein-structure
tape-neurips2019
Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. (DEPRECATED)
Stars: ✭ 117 (+125%)
Mutual labels:  protein-structure
awesome-computer-vision-models
A list of popular deep learning models related to classification, segmentation and detection problems
Stars: ✭ 419 (+705.77%)
Mutual labels:  resnet
Bio3DView.jl
A Julia package to view macromolecular structures in the REPL, in a Jupyter notebook/JupyterLab or in Pluto
Stars: ✭ 30 (-42.31%)
Mutual labels:  protein-structure
Deep-learning-for-contact map v2
Prediction of protein contact map
Stars: ✭ 19 (-63.46%)
Mutual labels:  resnet
RamaNet
Preforms De novo protein design using machine learning and PyRosetta to generate a novel protein structure
Stars: ✭ 41 (-21.15%)
Mutual labels:  protein-structure
sidechainnet
An all-atom protein structure dataset for machine learning.
Stars: ✭ 227 (+336.54%)
Mutual labels:  protein-structure
SAN
[ECCV 2020] Scale Adaptive Network: Learning to Learn Parameterized Classification Networks for Scalable Input Images
Stars: ✭ 41 (-21.15%)
Mutual labels:  resnet
mmtf-spark
Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.
Stars: ✭ 20 (-61.54%)
Mutual labels:  protein-structure
enspara
Modeling molecular ensembles with scalable data structures and parallel computing
Stars: ✭ 28 (-46.15%)
Mutual labels:  protein-structure
sparsezoo
Neural network model repository for highly sparse and sparse-quantized models with matching sparsification recipes
Stars: ✭ 264 (+407.69%)
Mutual labels:  resnet
PyTorch-LMDB
Scripts to work with LMDB + PyTorch for Imagenet training
Stars: ✭ 49 (-5.77%)
Mutual labels:  resnet

DMPfold

Build status

Consider using DMPfold2 which is faster, more accurate and easier to install.

Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints.

See our paper in Nature Communications for more. Please cite the paper if you use DMPfold.

You can also run DMPfold via the PSIPRED web server. This is a good way to get models for a few sequences, but if you want to run DMPfold on many sequences we strongly recommend you run it locally. The server version of DMPfold has restrictions on run time and uses parameters that give faster runs, so should not be used to benchmark DMPfold.

Installation

As it makes use of a lot of different software, installation can be a little fiddly. However we have aimed to make it as straightforward as possible. These instructions should work for a Linux system:

  • Make sure you have Python 3 with PyTorch 0.4 or later, NumPy and SciPy installed. GPU setup is optional for Pytorch - it won't speed things up much because running the network isn't a time-consuming step. DMPfold has been tested on Python 3.6 and 3.7. The command python3 should point to the Python that you want to use.
  • Install HH-suite and the uniclust30 database, unless you are getting your alignments from elsewhere.
  • Install FreeContact.
  • Install CCMpred.
  • Install MODELLER, which requires a license key. Only the Python package is required so this can be installed with conda install modeller -c salilab.
  • Install CNS. We found we had to follow all of the steps in this comment to get CNS working:
    • As per documentation, set the CNS_SOLVE environment variable to the appropriate location in both cns_solve_env (for csh, used for building CNS) and in .cns_solve_env_sh (for Bash, used when running DMPfold)
    • set MXFPEPS2 in machvar.inc to 8192,
    • set MXRTP in rtf.inc in the source directory to 4000 and in machvar.f add WRITE (6,'(I6,E10.3,E10.3)') I, ONEP, ONEM just above line 67, which looks like IF (ONE .EQ. ONEP .OR. ONE .EQ. ONEM) THEN.
    • We also had to install the flex-devel package via our system package manager.
    • you should change two values in cns_solve_1.3/modules/nmr/readdata to larger numbers to allow DMPfold to run on larger structures. Change the nrestraints = 20000 line to something like nrestraints = 50000 and the nassign 1600 line to something like nassign 3000.
    • To build CNS, in csh type source cns_solve_env; make install.
  • Download and patch the required CNS scripts by changing into the cnsfiles directory and running sh installscripts.sh.
  • Install CD-HIT, which is usually as simple as a clone and make. CD-HIT is not required if you don't need to predict the TM-score of generated models.
  • Install the legacy BLAST software, in particular formatdb, blastpgp and makemat. We may update this to BLAST+ in the future.
  • Other software is pre-compiled and included here (PSIPRED, PSICOV, various utility scripts with the code in src). This should run okay but may need separate compilation using the makefile if issues arise. Some other standard programs, such as csh shell, are assumed.
  • Change lines 10/13-15/18/21/24 in seq2maps.csh, lines 11/14/17/20 in aln2maps.csh, lines 4/7 in bin/runpsipredandsolvwithdb, lines 10/13 in run_dmpfold.sh and lines 7/10 in predict_tmscore.sh to point to the installed locations of the above software. You can also set the number of cores to use in seq2maps.csh and aln2maps.csh. This sets the number of cores for HHblits, PSICOV, FreeContact and CCMpred - the script will run faster with this set to a value larger than 1 (e.g. 4 or 8).

Check the continuous integration setup script and logs for additional tips and a step-by-step installation on Ubuntu.

Usage

Here we give an example of running DMPfold on Pfam family PF10963. First you need to generate the .21c and .map files. This can be done in one of two ways:

  • From a single sequence: csh seq2maps.csh example/PF10963.fasta to run HHblits, PSIPRED, SOLVPRED, PSICOV, FreeContact, CCMpred and alnstats.
  • From an alignment: csh aln2maps.csh example/PF10963.aln to run PSIPRED, SOLVPRED, PSICOV, FreeContact, CCMpred and alnstats. The file PF10963.aln has one sequence per line with the ungapped target sequence as the first line.

Then run sh run_dmpfold.sh example/PF10963.fasta PF10963.21c PF10963.map ./PF10963 to run DMPfold, where the last parameter is an output directory that will be created. Running sh run_dmpfold.sh example/PF10963.fasta PF10963.21c PF10963.map ./PF10963 5 20 instead runs 5 iterations with 20 models per iteration (default is 3 and 50). The final model is final_1.pdb and other structures may or may not be generated as final_2.pdb to final_5.pdb if they are significantly different. Many other files are generated totalling around 100 MB - these should be deleted to save disk space if you are running DMPfold on many sequences.

To predict the TM-score of a DMPfold model using our trained predictor, run sh predict_tmscore.sh example/PF10963.fasta PF10963.aln PF10963/final_1.pdb PF10963/rawdistpred.1. If this predictor estimates that a model has a TM-score of at least 0.5 then there is an 83% chance of this being the case according to cross-validation of the Pfam validation set.

See Supplementary Figure 1 in the paper for estimations on run time. It takes around 3 hours on a single core to carry out a complete DMPfold run for a 200 residue protein, but this can occasionally be much longer due to PSICOV not converging. 8 GB memory is generally sufficient to run DMPfold but more may be required for larger proteins.

Figure 5 in the paper gives some data on how DMPfold performs with respect to sequence length. Sequences up to around 600 residues in length can be modelled accurately, with performance degrading above this.

Data

Models for the 1,475 Pfam families modelled in the paper can be downloaded here. Additional models for the remainder of the dark Pfam families can be downloaded here (some were not modelled due to small sequence alignments). Models for the Pfam families used for validation can be downloaded here. Alignments for the Pfam families without available templates can be downloaded here. The format is one sequence per line with the ungapped target sequence as the first line.

The directory pfam in this repository contains text files with the lists from Figure 4A of the paper, target sequences for modelled families and data for modelled families (sequence length, effective sequence count, distogram satisfaction scores, estimated TM-score and probability TM-score >= 0.5).

The list of PDB chains used for training can be found here.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].