All Projects → smaegol → PlasFlow

smaegol / PlasFlow

Licence: GPL-3.0 license
Software for prediction of plasmid sequences in metagenomic assemblies

Programming Languages

python
139335 projects - #7 most used programming language
r
7636 projects
shell
77523 projects
perl
6916 projects

Projects that are alternatives of or similar to PlasFlow

redundans
Redundans is a pipeline that assists an assembly of heterozygous/polymorphic genomes.
Stars: ✭ 90 (+21.62%)
Mutual labels:  fasta, contigs
poly
A Go package for engineering organisms.
Stars: ✭ 270 (+264.86%)
Mutual labels:  fasta, plasmids
platon
Identification & characterization of bacterial plasmid-borne contigs from short-read draft assemblies.
Stars: ✭ 52 (-29.73%)
Mutual labels:  contigs, plasmids
Sales-Prediction
In depth analysis and forecasting of product sales based on the items, stores, transaction and other dependent variables like holidays and oil prices.
Stars: ✭ 56 (-24.32%)
Mutual labels:  prediction
Github-Stars-Predictor
It's a github repo star predictor that tries to predict the stars of any github repository having greater than 100 stars.
Stars: ✭ 34 (-54.05%)
Mutual labels:  prediction
R Unet
Video prediction using lstm and unet
Stars: ✭ 25 (-66.22%)
Mutual labels:  prediction
infer
🔮 Use TensorFlow models in Go to evaluate Images (and more soon!)
Stars: ✭ 65 (-12.16%)
Mutual labels:  prediction
GA-BP
基于遗传算法的BP网络设计,应用背景为交通流量的预测
Stars: ✭ 102 (+37.84%)
Mutual labels:  prediction
MSDS696-Masters-Final-Project
Earthquake Prediction Challenge with LightGBM and XGBoost
Stars: ✭ 58 (-21.62%)
Mutual labels:  prediction
reinforcement learning course materials
Lecture notes, tutorial tasks including solutions as well as online videos for the reinforcement learning course hosted by Paderborn University
Stars: ✭ 765 (+933.78%)
Mutual labels:  prediction
Diebold-Mariano-Test
This Python function dm_test implements the Diebold-Mariano Test (1995) to statistically test forecast accuracy equivalence for 2 sets of predictions with modification suggested by Harvey et. al (1997).
Stars: ✭ 70 (-5.41%)
Mutual labels:  prediction
Loan-Prediction-Dataset
No description or website provided.
Stars: ✭ 21 (-71.62%)
Mutual labels:  prediction
verif
Software for verifying weather forecasts
Stars: ✭ 70 (-5.41%)
Mutual labels:  prediction
stock-forecast
Simple stock & cryptocurrency price forecasting console application, using PHP Machine Learning library (https://github.com/php-ai/php-ml)
Stars: ✭ 76 (+2.7%)
Mutual labels:  prediction
PyDREAM
Python Implementation of Decay Replay Mining (DREAM)
Stars: ✭ 22 (-70.27%)
Mutual labels:  prediction
Topics-In-Modern-Statistical-Learning
Materials for STAT 991: Topics In Modern Statistical Learning (UPenn, 2022 Spring) - uncertainty quantification, conformal prediction, calibration, etc
Stars: ✭ 74 (+0%)
Mutual labels:  prediction
Pairfq
Sync paired-end FASTA/Q files and keep singleton reads
Stars: ✭ 18 (-75.68%)
Mutual labels:  fasta
Wharton Stat 422 722
The official class webpage for Statistics 422/722 taught at Wharton in the Spring of 2017
Stars: ✭ 14 (-81.08%)
Mutual labels:  prediction
vlainic.github.io
My GitHub blog: things you might be interested, and probably not...
Stars: ✭ 26 (-64.86%)
Mutual labels:  prediction
pyrodigal
Cython bindings and Python interface to Prodigal, an ORF finder for genomes and metagenomes. Now with SIMD!
Stars: ✭ 38 (-48.65%)
Mutual labels:  metagenomes

Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge PyPI

NOT MAINTAINED

Use at your own risk. I am very grateful that it is being widely used but, as I completely changed my research area I cannot give my time to maintain this project. There are other, newer packages developed, which can be used instead.

PlasFlow 1.1

PlasFlow is a set of scripts used for prediction of plasmid sequences in metagenomic contigs. It relies on the neural network models trained on full genome and plasmid sequences and is able to differentiate between plasmids and chromosomes with accuracy reaching 96%. It outperforms other available solutions for plasmids recovery from metagenomes and incorporates the thresholding which allows for exclusion of incertain predictions. PlasFlow has been published in Nucleic Acids Research (https://doi.org/10.1093/nar/gkx1321).

Table of contents

News

2018-05-25 Version 1.1 released

New version (1.1) released, which is better suited for large datasets. It can be downloaded from conda and pypi, but the simplest way to upgrade is to replace PlasFlow.py file in you previous installation with the current one. If you still encounter problems with the new version, try to use smaller numbers for the --batch_size option.

Requirements:

  • Python 3.5

  • Python packages:

    • Scikit-learn 0.18.1
    • Numpy
    • Pandas
    • TensorFlow 0.10.0
    • rpy2 >= 2.8
    • scipy
    • biopython
    • dateutil >= 2.5
  • R 3.25

  • R packages:

For the perl scripts, especially filter_sequences_by_length.pl:

Installation

Conda-based - recommended

Conda is recommended option for installation as it properly resolve all dependencies (including R and Biostrings) and allows for installation without messing with other packages installed. Conda can be used both as the Anaconda, and Miniconda (which is easier to install and maintain).

After the installation it is required to add bioconda channel, required for Biostrings package installation:

conda config --add channels bioconda

Sometimes it can be also required to add default conda channel (conda-forge):

conda config --add channels conda-forge

To exclude the possibility of dependencies conflicts its encouraged to create spearate conda environment for Plasflow using command:

conda create --name plasflow python=3.5

Python 3.5 is required becuase of TensorFlow requirements.

to activate created environment type:

source activate plasflow

Mac users should install Tensorflow at this step (as osx-64 package is not present in default channels). If you encounter any problems with missing TensorFlow dependency on other platforms also try to install TF from this source.

conda install -c jjhelmus tensorflow=0.10.0rc0

PlasFlow can be easily installed as an Anaconda package from my Anaconda channel using:

conda install plasflow -c smaegol

With this command all required dependencies are installed into created conda environment. When installation is finished PlasFlow can be invoked as described in the Getting started section.

When you decide to finish your work with PlasFlow, you can simply deactivate current anaconda environment with command:

source deactivate

Pip installer

There is a possibility of pip based installation. However, some requirements have to be met:

  1. Python 3.5 is required (due to TensorFlow requirements)
  2. TensorFlow has to be installed manually:
pip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0rc0-cp35-cp35m-linux_x86_64.whl

then install PlasFlow with

pip install plasflow

However, models used for prediction have to be downloaded separately (for example using git clone https://github.com/smaegol/PlasFlow).

Manual installation

Of course, PlasFlow repo can be cloned using

git clone https://github.com/smaegol/PlasFlow

but in that case all dependencies have to be installed manually. TensorFlow can be installed as specified above:

pip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0rc0-cp35-cp35m-linux_x86_64.whl

python dependencies can be installed using pip:

pip install numpy pandas scipy rpy2 scikit-learn biopython

to install R Biostrings go to https://bioconductor.org/packages/release/bioc/html/Biostrings.html and follow instructions therein.

Perl modules for additional scripts

Perl scripts (like filter_sequences_by_length.pl) included with PlasFlow requires few Perl modules. THey can be easily installed using conda:

conda install -c bioconda perl-bioperl perl-getopt-long

or cpan:

cpan -i Bio::Perl Getopt::longer

or any package manager included in your system (apt, brew)

Getting started

PlasFlow is designed to take a metagenomic assembly and identify contigs which may come from plasmids. It outputs several files, from which the most important is a tabular file containing all predictions (specified with --output option).

Prior to the PlasFlow invocation it is highly recommended to filter sequences by length, leaving only those longer than 1000 bp. PlasFlow, similarly to other kmer-based methods, does not perform well on short sequences, as it is hard to get proper kmer coverage from them. Hence, results for short sequences are unreliable. As metagenomic assemblies usually contain large number of short contigs additional filtering test can improve results and speed up the PlasFlow. It can also prevent too high RAM usage.

To filter sequences using provided Perl script type:

filter_sequences_by_length.pl -input input_dataset.fasta -output filtered_output.fasta -thresh sequence_length_threshold

where sequence length threshold have to be provided in base pairs. Filtered fasta file can be then used directly for PlasFlow prediction.

Options available in PlasFlow include:

  • --input - specifies input fasta file with assembly contigs to classify [required]
  • --output - a name of the tsv file with the tabular output of classification [required]
  • --threshold - manually specified threshold for probability filtering (default = 0.7)
  • --labels - manually specified custom location of labels file (used for translation from numeric output to actual class names)
  • --models - custom location of models used for prediction (have to be specified if PlasFlow was installed using pip)
  • --batch_size - how many sequences can be used in the single batch of kmers frequency calculation

Output

The most important output of PlasFlow is a tabular file containing all predictions (specified with --output option), consiting of several columns including:

contig_id contig_name contig_length id label ...

where:

  • contig_idis an internal id of sequence used for the classification
  • contig_name is a name of contig used in the classification
  • contig_length shows the length of a classified sequence
  • id is an internal id of a produced label (classification)
  • label is the actual classification
  • ... represents additional columns showing probabilities of assignment to each possible class

Sequences can be classified to 26 classes including: chromosome.Acidobacteria, chromosome.Actinobacteria, chromosome.Bacteroidetes, chromosome.Chlamydiae, chromosome.Chlorobi, chromosome.Chloroflexi, chromosome.Cyanobacteria, chromosome.DeinococcusThermus, chromosome.Firmicutes, chromosome.Fusobacteria, chromosome.Nitrospirae, chromosome.other, chromosome.Planctomycetes, chromosome.Proteobacteria, chromosome.Spirochaetes, chromosome.Tenericutes, chromosome.Thermotogae, chromosome.Verrucomicrobia, plasmid.Actinobacteria, plasmid.Bacteroidetes, plasmid.Chlamydiae, plasmid.Cyanobacteria, plasmid.DeinococcusThermus, plasmid.Firmicutes, plasmid.Fusobacteria, plasmid.other, plasmid.Proteobacteria, plasmid.Spirochaetes.

If the probability of assignment to given class is lower than threshold (default = 0.7) then the sequence is treated as unclassified.

Additionaly, PlasFlow produces fasta files containing input sequences binned to plasmids, chromosomes and unclassified.

Test dataset

Test dataset is located in the test folder (file Citrobacter_freundii_strain_CAV1321_scaffolds.fasta). It is the SPAdes 3.9.1 assembly of Citrobacter freundii strain CAV1321 genome (NCBI assembly ID: GCA_001022155.1), which contains 1 chromosome and 9 plasmids. In the same folder the results of classification can be found in the form of tsv file (Citrobacter_freundii_strain_CAV1321_scaffolds.fasta.PlasFlow.tsv) and fasta files containing identified bins (Citrobacter_freundii_strain_CAV1321_scaffolds.fasta.PlasFlow.tsv_chromosomes.fasta, Citrobacter_freundii_strain_CAV1321_scaffolds.fasta.PlasFlow.tsv_plasmids.fasta and Citrobacter_freundii_strain_CAV1321_scaffolds.fasta.PlasFlow.tsv_unclassified.fasta).

To invoke PlasFlow on the test dataset please copy the test/Citrobacter_freundii_strain_CAV1321_scaffolds.fasta file to you current working directory and type:

PlasFlow.py --input Citrobacter_freundii_strain_CAV1321_scaffolds.fasta --output test.plasflow_predictions.tsv --threshold 0.7

The predictions will be located in the test.plasflow_predictions.tsv file and can be compared to results available in the test/Citrobacter_freundii_strain_CAV1321_scaffolds.fasta.PlasFlow.tsv.

Detailed information

Detailed information concerning the alogrithm and assumptions on which the PlasFlow is based can be found in the publication "PlasFlow - Predicting Plasmid Sequences in Metagenomic Data Using Genome Signatures" (Nucleic Acids Research, submitted). The flowchart illustrating major steps of training and prediction is shown below

PlasFlow Flowchart

All models tested and described in the manuscript can be found in the seperate repository: https://github.com/smaegol/PlasFlow_models

Scripts used for the preparation of training dataset and for neural network training are available in the scripts subfolder as well in the separate repository: https://github.com/smaegol/PlasFlow_processing

Citation

Please cite the following paper when using PlasFlow for your own research.

Krawczyk PS, Lipinski L, Dziembowski A. Nucleic Acids Res. 2018 Apr 6;46(6):e35. doi: 10.1093/nar/gkx1321.

TBD

In next releases we plan to retrain models using the most recent TensorFlow release. During the development of PlasFlow there was a lot of changes in the TensorFlow library and the newest version is not compatible with models trained for TensorFlow. However, retraining requires signficant computational effort and recoding. As we want to include Archaea sequences (which are missed now) in the models, we plan to train new models with the latest TensorFlow version and release new version of PlasFlow in the second part of 2018.

Support

Any issues connected with the PlasFlow should be addressed to Pawel Krawczyk (p.krawczyk (at) ibb.waw.pl).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].