All Projects → debbiemarkslab → plmc

debbiemarkslab / plmc

Licence: MIT license
Inference of couplings in proteins and RNAs from sequence variation

Programming Languages

c
50402 projects - #5 most used programming language
matlab
3953 projects

Projects that are alternatives of or similar to plmc

pytorch-rgn
Recurrent Geometric Network in Pytorch
Stars: ✭ 28 (-67.06%)
Mutual labels:  protein-structure, protein-sequences
mmtf-spark
Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.
Stars: ✭ 20 (-76.47%)
Mutual labels:  protein-structure, protein-sequences
tape-neurips2019
Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. (DEPRECATED)
Stars: ✭ 117 (+37.65%)
Mutual labels:  protein-structure, protein-sequences
deepblast
Neural Networks for Protein Sequence Alignment
Stars: ✭ 29 (-65.88%)
Mutual labels:  protein-structure, protein-sequences
parapred
Paratope Prediction using Deep Learning
Stars: ✭ 49 (-42.35%)
Mutual labels:  protein-structure, protein-sequences
gcWGAN
Guided Conditional Wasserstein GAN for De Novo Protein Design
Stars: ✭ 38 (-55.29%)
Mutual labels:  protein-structure
hPDB
PDB parser in Haskell
Stars: ✭ 20 (-76.47%)
Mutual labels:  protein-structure
ProteinSecondaryStructure-CNN
Protein Secondary Structure predictor using Convolutional Neural Networks
Stars: ✭ 82 (-3.53%)
Mutual labels:  protein-structure
DMPfold
De novo protein structure prediction using iteratively predicted structural constraints
Stars: ✭ 52 (-38.82%)
Mutual labels:  protein-structure
r3dmol
🧬 An R package for visualizing molecular data in 3D
Stars: ✭ 45 (-47.06%)
Mutual labels:  protein-structure
DeepCov
Fully convolutional neural networks for protein residue-residue contact prediction
Stars: ✭ 36 (-57.65%)
Mutual labels:  protein-structure
lightdock
Protein-protein, protein-peptide and protein-DNA docking framework based on the GSO algorithm
Stars: ✭ 110 (+29.41%)
Mutual labels:  protein-structure
VSCoding-Sequence
VSCode Extension for interactively visualising protein structure data in the editor
Stars: ✭ 41 (-51.76%)
Mutual labels:  protein-structure
hotspot3d
3D hotspot mutation proximity analysis tool
Stars: ✭ 43 (-49.41%)
Mutual labels:  protein-structure
Biopython
Official git repository for Biopython (originally converted from CVS)
Stars: ✭ 2,936 (+3354.12%)
Mutual labels:  protein-structure
mmtf
The specification of the MMTF format for biological structures
Stars: ✭ 40 (-52.94%)
Mutual labels:  protein-structure
ddpm-proteins
A denoising diffusion probabilistic model (DDPM) tailored for conditional generation of protein distograms
Stars: ✭ 55 (-35.29%)
Mutual labels:  protein-structure
FunFolDesData
Rosetta FunFolDes – a general framework for the computational design of functional proteins.
Stars: ✭ 15 (-82.35%)
Mutual labels:  protein-structure
geometric-vector-perceptron
Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021
Stars: ✭ 45 (-47.06%)
Mutual labels:  protein-structure
SeqVec
Modelling the Language of Life - Deep Learning Protein Sequences
Stars: ✭ 74 (-12.94%)
Mutual labels:  protein-structure

plmc

plmc infers undirected graphical models to describe coevolution and covariation in families of biological sequences. With a multiple sequence alignment as an input, plmc can quantify inferred coupling strengths between all pairs of positions (couplingsfile output) or infer a generative model of the sequences for predicting the effects of mutations or designing new sequences (paramfile output).

Usage

  plmc [options] alignmentfile
  plmc -c couplingsfile alignmentfile
  plmc -o paramfile -c couplingsfile alignmentfile
  plmc [-h | --help]
  
Required input:
  alignmentfile                    Multiple sequence alignment in FASTA format

Options, input:
  -w  --weights    weightsfile     Load sequence weights from file (one weight per line)

Options, output:
  -c  --couplings  couplingsfile   Save coupling scores to file (text)
  -o  --output     paramfile       Save estimated parameters to file (binary)
  --save-weights   weightsfile     Save sequence weights to file (text)

Options, alignment processing:
  -s  --scale      <value>         Sequence weights: neighborhood weight [s > 0]
  -t  --theta      <value>         Sequence weights: neighborhood divergence [0 < t < 1]

Options, Maximum a posteriori estimation (L-BFGS, default):
  -lh --lambdah    <value>         Set L2 lambda for fields (h_i)
  -le --lambdae    <value>         Set L2 lambda for couplings (e_ij)
  -lg --lambdag    <value>         Set group L1 lambda for couplings (e_ij)

Options, general:
      --fast                       Fast weights and stochastic gradient descent
  -a  --alphabet   alphabet        Alternative alphabet (default 21: -ACDEFGHIKLMNPQRSTVWY)
  -f  --focus      identifier      Select only uppercase, non-gapped sites from a focus sequence
  -g  --gapignore                  Exclude first alphabet character from potential calculations
  -m  --maxiter                    Maximum number of iterations
  -n  --ncores    [<number>|max]   Maximum number of threads to use in OpenMP
  -h  --help                       Usage

Compilation

plmc requires no external libraries, but can optionally be accelerated with OpenMP for multicore parallelism.

Multicore. To compile with gcc and OpenMP:

make all-openmp

Multicore, macOS. To install OpenMP through homebrew and compile with clang:

brew install libomp
make all-mac-openmp

Single core, Linux. To compile with gcc:

make all

Single core, macOS. To compile with clang:

make all-mac

Single precision. All of the above targets compile to double precision (64 bit), but reducing the precision to single (32 bit) increases speed and decreases memory requirements by approximately a factor of two. The fastest compile settings are then:

make all-openmp32

Output

Coupling scores. The couplingsfile is a flat text file containing scores quantifying the inferred strength of the coupling between every pair of positions. It has 6 columns: RES_I FOCUS_AI RES_J FOCUS_AJ 0 SCORE, where SCORE is the coupling score between positions RES_I and RES_J, FOCUS_AI and FOCUS_AJ are the letters in the focus sequence (optional, - if no focus), 0 is a placeholder. The SCORE values are APC-corrected Frobenius norm scores, but alternative scores can be computed from the raw parameter values.

Parameter estimates. The optional paramfile specified with -o, will store all inferred model parameters in binary. These can get large, as for proteins the model explicitly parameterizes all possible pairs of amino acids at all possible pairs of postions, which is about 106-108 numbers for families of lengths ~70-700. The MATLAB script scripts/read_params.m unpacks this binary file into model parameters as well as associated metadata, such as inferred sequence weights.

Sequence weights. The optional weightsfile specified with --save-weights (and loaded with -w) is a flat text file containing one floating-point weight per line. Each sequence's weight is the inverse of the number of neighboring sequences with less than θ percent divergence.

Examples

Protein alignments. The example directory includes an alignment of the protein dihdyrofolate reductase (DHFR). To infer a model for this family, we can type the following in the base directory:

bin/plmc -o example/protein/DHFR.params -le 16.0 -lh 0.01 -m 100 -g -f DYR_ECOLI example/protein/DHFR.a2m

The numeric options set a strong L2 regularization for the couplings, λe = 16.0, a weak L2 regularization for the sites, λh = 0.01, and the maximum number of iterations at 100. The focus -f option tells plmc to only model columns that are upper-case in the E. coli sequence DYR_ECOLI (NOTE: for focus mode, the alignment should only contain columns that are coding in the focus sequence, otherwise the offsets for the output sequence will be incorrect.) The -g gap-ignoring option ignores gaps by modeling only the coding portions of each sequence. To read the binary paramfile DHFR.eij and visualize the couplings, we can type the following in MATLAB from the scripts directory:

params = read_params('../example/protein/DHFR.params');
plot_coupling_scores(params)

This computes and plots the same (APC-corrected) coupling strengths that would be output to couplingfile, revealing many strongly coupled pairs of positions in the long-term evolution of DHFR:

Each coupling strength summarizes a larger set of coupling parameters, since the model explicitly parameterizes each possible combination of amino acids at each pair of positions. To visualize these sequence-level parameters, we can use a JavaScript tool EVzoom that is designed to visualize plmc models. For efficiency, we export only the strongly coupled pairs (culled by an EM-based outlier-detection method) in a lightweight JSON format with

  export_couplings_json(read_params('../example/protein/DHFR.params'), '../example/protein/DHFR.json')

EVzoom makes it possible to navigate the large number of sequence-level parameters present in undirected models of sequence families:

RNA alignments. An example RNA alignment is included for the SAM riboswitch. To infer the couplings with an RNA alphabet (.ACGU) type the following in the base directory:

bin/plmc -c example/RNA/RF00162.EC -o example/RNA/RF00162.params -a .ACGU -le 20.0 -lh 0.01 -m 50 example/RNA/RF00162.fasta

To plot the aggregated coupling scores type the following in MATLAB from the scripts directory:

params = read_params('../example/RNA/RF00162.params');
plot_coupling_scores(params)

The default alphabet (-ACDEFGHIKLMNPQRSTVWY) can be overridden by the option -a ALPHABET. All uppercase letters and non-letter ASCII characters are acceptable. Lowercase letters in the alignment file will be mapped to the corresponding uppercase letter, except for in focus mode -f, in which columns that are lower case in the focus sequence will be discarded.

Physical systems. As an example, simulated draws from a 3-state, 1-dimensional Potts model are provided in the examples folder and encoded by the characters _, *, and ^. The following command would estimate the parameters by running to convergence with λe = 1.0, λh = 1.0 and sequence reweighting disabled:

bin/plmc -c example/potts/potts3.txt -a _*^ -t -1 -le 1.0 -lh 1.0 example/potts/potts3.a2m

A 1D Potts model will only have interactions between i -> i + 1, which should be evident in the coupling summary scores output to example/potts/potts3.txt

References

The use of pseudolikelihood approximations for approximate inference of spin models for sequence families was established in the compsci/bioinformatics literature by

Balakrishnan, S., Kamisetty, H., Carbonell, J. G., Lee, S. I., & Langmead, C. J. (2011). Learning generative models for protein fold families. Proteins: Structure, Function, and Bioinformatics, 79(4), 1061-1078.

and in the biophysics literature by

Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M., & Aurell, E. (2013). Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Physical Review E, 87(1), 012707.

plmc implements a joint optimization version of inference described in these and subsequent works (i.e. the 'symmetric' pseudolikelihood). If you'd like to use plmc in your own work, please cite the following paper. Also, please let us know if you have any comments or questions!

Hopf, T. A., Ingraham, J. B., Poelwijk, F. J., Schärfe, C. P., Springer, M., Sander, C., & Marks, D. S. (2017). Mutation effects predicted from sequence co-variation. Nature Biotechnology, 35(2), 128-135.

Author

plmc was written by John Ingraham in Debora Marks' lab at Harvard Medical School

Credits

This code uses a C implementation of L-BFGS by Naoaki Okazaki and Nishimura and Matsumoto's Mersenne Twister in C, which are included in this repo.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].