All Projects → EBI-Metagenomics → emg-viral-pipeline

EBI-Metagenomics / emg-viral-pipeline

Licence: Apache-2.0 license
VIRify: detection of phages and eukaryotic viruses from metagenomic and metatranscriptomic assemblies

Programming Languages

python
139335 projects - #7 most used programming language
Nextflow
61 projects
shell
77523 projects
Dockerfile
14818 projects
r
7636 projects
ruby
36898 projects - #4 most used programming language

Projects that are alternatives of or similar to emg-viral-pipeline

TOGGLE
Toolbox for generic NGS analyses - A framework to quickly build pipelines and to perform large-scale NGS analysis
Stars: ✭ 18 (-52.63%)
Mutual labels:  pipeline
golang-docker-example
An example of how to run a Golang project in Docker in a Buildkite pipeline
Stars: ✭ 18 (-52.63%)
Mutual labels:  pipeline
nwabap-ui5uploader
This module allows a developer to upload SAPUI5/OpenUI5 sources into a SAP NetWeaver ABAP system.
Stars: ✭ 15 (-60.53%)
Mutual labels:  pipeline
flow-platform-x
Continuous Integration Platform
Stars: ✭ 21 (-44.74%)
Mutual labels:  pipeline
sagemaker-sparkml-serving-container
This code is used to build & run a Docker container for performing predictions against a Spark ML Pipeline.
Stars: ✭ 44 (+15.79%)
Mutual labels:  pipeline
mydataharbor
🇨🇳 MyDataHarbor是一个致力于解决任意数据源到任意数据源的分布式、高扩展性、高性能、事务级的数据同步中间件。帮助用户可靠、快速、稳定的对海量数据进行准实时增量同步或者定时全量同步,主要定位是为实时交易系统服务,亦可用于大数据的数据同步(ETL领域)。
Stars: ✭ 28 (-26.32%)
Mutual labels:  pipeline
tfa
tfa is a 2fa cli tool that aims to help you to generate 2fa code on CI/CD pipelines.
Stars: ✭ 25 (-34.21%)
Mutual labels:  pipeline
kubecrypt
Helper for dealing with secrets in kubernetes.
Stars: ✭ 23 (-39.47%)
Mutual labels:  pipeline
EF-Migrations-Script-Generator-Task
No description or website provided.
Stars: ✭ 20 (-47.37%)
Mutual labels:  pipeline
google classroom
Google Classroom Data Pipeline
Stars: ✭ 17 (-55.26%)
Mutual labels:  pipeline
classification
Catalyst.Classification
Stars: ✭ 35 (-7.89%)
Mutual labels:  pipeline
scriptcwl
Create cwl workflows by writing a simple Python script
Stars: ✭ 40 (+5.26%)
Mutual labels:  cwl
gunpowder
A library to facilitate machine learning on multi-dimensional images.
Stars: ✭ 40 (+5.26%)
Mutual labels:  pipeline
predict-fraud-using-auto-ai
Use AutoAI to detect fraud
Stars: ✭ 27 (-28.95%)
Mutual labels:  pipeline
pipe-trait
Make it possible to chain regular functions
Stars: ✭ 22 (-42.11%)
Mutual labels:  pipeline
node-express-azure
Node & Express Demo App for Azure DevOps
Stars: ✭ 31 (-18.42%)
Mutual labels:  pipeline
biojupies
Automated generation of tailored bioinformatics Jupyter Notebooks via a user interface.
Stars: ✭ 96 (+152.63%)
Mutual labels:  pipeline
ipython2cwl
IPython2CWL is a tool for converting IPython Jupyter Notebooks to CWL Command Line Tools by simply providing typing annotation.
Stars: ✭ 15 (-60.53%)
Mutual labels:  cwl
bump-everywhere
🚀 Automate versioning, changelog creation, README updates and GitHub releases using GitHub Actions,npm, docker or bash.
Stars: ✭ 24 (-36.84%)
Mutual labels:  pipeline
get phylomarkers
A pipeline to select optimal markers for microbial phylogenomics and species tree estimation using coalescent and concatenation approaches
Stars: ✭ 34 (-10.53%)
Mutual labels:  pipeline

Build Status

  1. VIRify pipeline
  2. CWL execution
  3. Nextflow execution

VIRify

Sankey plot VIRify is a recently developed pipeline for the detection, annotation, and taxonomic classification of viral contigs in metagenomic and metatranscriptomic assemblies. The pipeline is part of the repertoire of analysis services offered by MGnify. VIRify’s taxonomic classification relies on the detection of taxon-specific profile hidden Markov models (HMMs), built upon a set of 22,014 orthologous protein domains and referred to as ViPhOGs.

The pipeline is implemented and available in CWL and Nextflow.

Common Workflow Language

VIRify was implemented in Common Workflow Language (CWL).

What do I need?

The current implementation uses CWL version 1.2. It was tested using Toil version 5.3.0 as the workflow engine and conda to manage the software dependencies.

How?

For instructions go to the CWL README

Nextflow

A Nextflow implementation of the VIRify pipeline. In the backend, the same scripts are used as in the CWL implementation.

What do I need?

This pipeline runs with the workflow manager Nextflow and needs as second dependency either Docker or Singularity. Conda will be implemented soonish, hopefully. However, we highly recommend the usage of the stable containers. All other programs and databases are automatically downloaded by Nextflow. Attention, the workflow will download databases with a size of roughly 19 GB (49 GB with --hmmextend and --blastextend) the first time it is executed.

Install Nextflow

curl -s https://get.nextflow.io | bash

Install Docker

If you dont have experience with bioinformatic tools and their installation just copy the commands into your terminal to set everything up:

sudo apt-get install -y docker-ce docker-ce-cli containerd.io
sudo usermod -a -G docker $USER

Install Singularity

While singularity can be installed via Conda, we recommend setting up a true Singularity installation. For HPCs, ask the system administrator you trust. Here is also a good manual to get you started. Please note: you only need Docker or Singularity.

Basic execution

Simply clone this repository and execute virify.nf:

git clone https://github.com/EBI-Metagenomics/emg-viral-pipeline.git
cd emg-viral-pipeline
nextflow run virify.nf --help

or (recommended) let Nextflow handle the installation. With the same command you can update the pipeline.

nextflow pull EBI-Metagenomics/emg-viral-pipeline

Get help:

nextflow run EBI-Metagenomics/emg-viral-pipeline --help

We highly recommend to run stable releases, also for reproducibility:

nextflow run EBI-Metagenomics/emg-viral-pipeline -r v0.2.0 --help

Run annotation for a small assembly file (10 contigs, 0.78 Mbp) on your local machine using Docker containers (per default --cores 4; takes approximately 10 min on a 8 core i7 laptop + time for database download; ~19 GB):

nextflow run EBI-Metagenomics/emg-viral-pipeline -r v0.2.0 --fasta "/home/$USER/.nextflow/assets/EBI-Metagenomics/emg-viral-pipeline/nextflow/test/assembly.fasta" --cores 4 -profile local,docker

Please note that in particular further parameters such as

  • --workdir or -w (here your work directories will be save)
  • --databases (here your databases will be saved and the workflow checks if they are already available)
  • --cachedir (here Singularity containers will be cached, not needed for Docker)

are important to handle where Nextflow writes files.

Execution specific for the EBI cluster:

source /hps/nobackup2/production/metagenomics/virus-pipeline/CONFIG 

# recommended run example to easily resume a run later and to have all run-related .nextflow.log files in the correct folder
OUTPUT=/path/to/output/dir
mkdir -p $OUTPUT
DIR=$PWD
cd $OUTPUT
# this will pull the pipeline if it is not already available
# use `nextflow pull EBI-Metagenomics/emg-viral-pipeline` to update the pipeline
nextflow run EBI-Metagenomics/emg-viral-pipeline -r v0.2.0 \
--fasta "/homes/$USER/.nextflow/assets/EBI-Metagenomics/emg-viral-pipeline/nextflow/test/assembly.fasta" \
--output $OUTPUT --workdir $OUTPUT/work --databases $DATABASES \
--cachedir $SINGULARITY -profile ebi
cd $DIR

Profiles

Nextflow uses a merged profile handling system so you have to define an executor (e.g., local, lsf, slurm) and an engine (docker, singularity) to run the pipeline according to your needs and infrastructure

Per default, the workflow runs locally (e.g. on your laptop) with Docker. When you execute the workflow on a HPC you can for example switch to a specific job scheduler and Singularity instead of Docker:

  • SLURM (-profile slurm,singularity)
  • LSF (-profile lsf,singularity)

Dont forget, especially on an HPC, to define further important parameters such as

  • --workdir or -w (here your work directories will be save)
  • --databases (here your databases will be saved and the workflow checks if they are already available)
  • --cachedir (here Singularity containers will be cached)

The engine conda is not working at the moment until there is a conda recipe for PPR-Meta. Sorry. Use Docker. Please. Or install PPR-Meta by yourself and then use the conda profile.

DAG chart

DAG chart

A note about metatranscriptomes

Although VIRify has been benchmarked and validated with metagenomic data in mind, it is also possible to use this tool to detect RNA viruses in metatranscriptome assemblies (e.g. SARS-CoV-2). However, some additional considerations for this purpose are outlined below:

1. Quality control: As for metagenomic data, a thorough quality control of the FASTQ sequence reads to remove low-quality bases, adapters and host contamination (if appropriate) is required prior to assembly. This is especially important for metatranscriptomes as small errors can further decrease the quality and contiguity of the assembly obtained. We have used TrimGalore for this purpose.

2. Assembly: There are many assemblers available that are appropriate for either metagenomic or single-species transcriptomic data. However, to our knowledge, there is no assembler currently available specifically for metatranscriptomic data. From our preliminary investigations, we have found that transcriptome-specific assemblers (e.g. rnaSPAdes) generate more contiguous and complete metatranscriptome assemblies compared to metagenomic alternatives (e.g. MEGAHIT and metaSPAdes).

3. Post-processing: Metatranscriptomes generate highly fragmented assemblies. Therefore, filtering contigs based on a set minimum length has a substantial impact in the number of contigs processed in VIRify. It has also been observed that the number of false-positive detections of VirFinder (one of the tools included in VIRify) is lower among larger contigs. The choice of a length threshold will depend on the complexity of the sample and the sequencing technology used, but in our experience any contigs <2 kb should be analysed with caution.

4. Classification: The classification module of VIRify depends on the presence of a minimum number and proportion of phylogenetically-informative genes within each contig in order to confidently assign a taxonomic lineage. Therefore, short contigs typically obtained from metatranscriptome assemblies remain generally unclassified. For targeted classification of RNA viruses (for instance, to search for Coronavirus-related sequences), alternative DNA- or protein-based classification methods can be used. Two of the possible options are: (i) using MashMap to screen the VIRify contigs against a database of RNA viruses (e.g. Coronaviridae) or (ii) using hmmsearch to screen the proteins obtained in the VIRify contigs against marker genes of the taxon of interest.

Cite

If you use VIRify in your work, please cite:

TBA

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].