MGnify genome analysis pipeline

MGnify CWL pipeline to characterize a set of isolate or metagenome-assembled genomes (MAGs) using the workflow described in the following publication:

A Almeida, S Nayfach, M Boland, F Strozzi, M Beracochea, ZJ Shi, KS Pollard, E Sakharova, DH Parks, P Hugenholtz, N Segata, NC Kyrpides and RD Finn. (2020) A unified catalog of 204,938 reference genomes from the human gut microbiome. Nature Biotechnol. doi: https://doi.org/10.1038/s41587-020-0603-3

Clone repo

git clone https://github.com/EBI-Metagenomics/genomes-pipeline.git
cd genomes-pipeline

Installation with Docker

Install all necessary tools (better use separate env):

cwltool (tested v1.0.2) or toil
Docker or Singularity
conda

Add python scripts to PATH

export PATH=${PATH}:docker/python3_scripts:docker/genomes-catalog-update/scripts

All dockers were pushed on DockerHub. If you want to re-build dockers:

cd docker
bash build.sh

Installation without Docker

Install the necessary dependencies:

cwltool (tested v1.0.2) or toil
R (tested v3.5.2). Packages: reshape2, fastcluster, optparse, data.table and ape.
Python v3.6+
Perl
CheckM (tested v1.0.11)
CAT (tested v5.0)
cmsearch
dRep (tested v2.2.4)
eggNOG-mapper (tested v2.0)
GTDB-Tk (tested v0.3.1 and v1.0.2)
GUNC
InterProScan (tested v5.35-74.0 and v5.38-76.0)
MMseqs2 (tested v8-fac81)
Panaroo
Prokka (tested 1.14.0)
samtools
tRNAscan-SE

Add custom scripts to your $PATH environment.

export PATH=${PATH}:docker/genomes-catalog-update/scriptsexport 
export PATH=${PATH}:docker/python3_scripts
export PATH=${PATH}:docker/bash
export PATH=${PATH}:docker/detect_rRNA
export PATH=${PATH}:docker/GUNC
export PATH=${PATH}:docker/mash2nwk
export PATH=${PATH}:docker/mmseqs

Download databases

bash download_db.sh

Run

Note: You can manually change parameters of MMseqs2 for protein clustering in your YML file (arguments mmseqs_limit_i, mmseq_limit_annotation, mmseqs_limit_c)

You need to pre-download your data to directory (GENOMES) and make sure that all genomes are not compressed
Create YML file with our help-script:

export GENOMES=


python3 installation/create_yml.py \
        -d ${GENOMES} ...

Pipeline structure

Output files/folders:

MGYG...NUM
         --- genome
              --- fa
              --- fa.fai
              --- faa (main rep)
              --- gff (main rep)
         --- pan-genome
              --- core_genes.txt
              --- <cluster>_mashtree.nwk
              --- pan_genome_reference.fa
              --- gene_presence_absence.Rtab
   MGYG...NUM
         --- genome
              --- fa
              --- fa.fai
              --- gff
              --- faa
  mmseqs_cluster_rep.emapper.annotations 
  mmseqs_cluster_rep.emapper.seed_orthologs
  mmseqs_cluster_rep.IPS.tsv

  intermediate_files/
         --- clusters_split.txt
         --- drep-filt-list.txt
         --- extra_weight_table.txt
         --- gunc_report_completed.txt
         --- names.tsv
         --- renamed_download.csv
         --- Sdb.csv
         --- mmseq.tsv
  gtdb-tk_output/ ( commented yet)
  rRNA_fastas/
  rRNA_outs/
  GFFs/
  mmseqs_output/
        mmseqs_0.5_outdir.tar.gz
        mmseqs_0.95_outdir.tar.gz
        mmseqs_0.9_outdir.tar.gz
        mmseqs_1.0_outdir.tar.gz
  panaroo_output/
        MGYG.._panaroo.tar.gz
        ...
  per-genome-annotations/ (for post-processing)
  drep_genomes/                   (for GTDB-Tk)

Tool description

CheckM: Estimate genome completeness and contamination.
GTDB-Tk: Genome taxonomic assignment using the GTDB framework.
dRep: Genome de-replication.
Mash2Nwk: Generate Mash distance tree of conspecific genomes.
Prokka: Predict protein-coding sequences from genome assembly.
MMseqs2: Cluster protein-coding sequences.
InterProScan: Protein functional annotation using the InterPro database.
eggNOG-mapper: Protein functional annotation using the eggNOG database.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

EBI-Metagenomics / genomes-pipeline

Programming Languages