MGnify genome analysis pipeline
MGnify CWL pipeline to characterize a set of isolate or metagenome-assembled genomes (MAGs) using the workflow described in the following publication:
A Almeida, S Nayfach, M Boland, F Strozzi, M Beracochea, ZJ Shi, KS Pollard, E Sakharova, DH Parks, P Hugenholtz, N Segata, NC Kyrpides and RD Finn. (2020) A unified catalog of 204,938 reference genomes from the human gut microbiome. Nature Biotechnol. doi: https://doi.org/10.1038/s41587-020-0603-3
Clone repo
git clone https://github.com/EBI-Metagenomics/genomes-pipeline.git
cd genomes-pipeline
Installation with Docker
- Install all necessary tools (better use separate env):
- cwltool (tested v1.0.2) or toil
- Docker or Singularity
- conda
- Add python scripts to PATH
export PATH=${PATH}:docker/python3_scripts:docker/genomes-catalog-update/scripts
All dockers were pushed on DockerHub. If you want to re-build dockers:
cd docker
bash build.sh
Installation without Docker
- Install the necessary dependencies:
- cwltool (tested v1.0.2) or toil
- R (tested v3.5.2). Packages: reshape2, fastcluster, optparse, data.table and ape.
- Python v3.6+
- Perl
- CheckM (tested v1.0.11)
- CAT (tested v5.0)
- cmsearch
- dRep (tested v2.2.4)
- eggNOG-mapper (tested v2.0)
- GTDB-Tk (tested v0.3.1 and v1.0.2)
- GUNC
- InterProScan (tested v5.35-74.0 and v5.38-76.0)
- MMseqs2 (tested v8-fac81)
- Panaroo
- Prokka (tested 1.14.0)
- samtools
- tRNAscan-SE
- Add custom scripts to your
$PATH
environment.
export PATH=${PATH}:docker/genomes-catalog-update/scriptsexport
export PATH=${PATH}:docker/python3_scripts
export PATH=${PATH}:docker/bash
export PATH=${PATH}:docker/detect_rRNA
export PATH=${PATH}:docker/GUNC
export PATH=${PATH}:docker/mash2nwk
export PATH=${PATH}:docker/mmseqs
Download databases
bash download_db.sh
Run
Note: You can manually change parameters of MMseqs2 for protein clustering in your YML file (arguments mmseqs_limit_i, mmseq_limit_annotation, mmseqs_limit_c)
- You need to pre-download your data to directory (GENOMES) and make sure that all genomes are not compressed
- Create YML file with our help-script:
export GENOMES=
python3 installation/create_yml.py \
-d ${GENOMES} ...
Pipeline structure
Output files/folders:
MGYG...NUM
--- genome
--- fa
--- fa.fai
--- faa (main rep)
--- gff (main rep)
--- pan-genome
--- core_genes.txt
--- <cluster>_mashtree.nwk
--- pan_genome_reference.fa
--- gene_presence_absence.Rtab
MGYG...NUM
--- genome
--- fa
--- fa.fai
--- gff
--- faa
mmseqs_cluster_rep.emapper.annotations
mmseqs_cluster_rep.emapper.seed_orthologs
mmseqs_cluster_rep.IPS.tsv
intermediate_files/
--- clusters_split.txt
--- drep-filt-list.txt
--- extra_weight_table.txt
--- gunc_report_completed.txt
--- names.tsv
--- renamed_download.csv
--- Sdb.csv
--- mmseq.tsv
gtdb-tk_output/ ( commented yet)
rRNA_fastas/
rRNA_outs/
GFFs/
mmseqs_output/
mmseqs_0.5_outdir.tar.gz
mmseqs_0.95_outdir.tar.gz
mmseqs_0.9_outdir.tar.gz
mmseqs_1.0_outdir.tar.gz
panaroo_output/
MGYG.._panaroo.tar.gz
...
per-genome-annotations/ (for post-processing)
drep_genomes/ (for GTDB-Tk)
Tool description
- CheckM: Estimate genome completeness and contamination.
- GTDB-Tk: Genome taxonomic assignment using the GTDB framework.
- dRep: Genome de-replication.
- Mash2Nwk: Generate Mash distance tree of conspecific genomes.
- Prokka: Predict protein-coding sequences from genome assembly.
- MMseqs2: Cluster protein-coding sequences.
- InterProScan: Protein functional annotation using the InterPro database.
- eggNOG-mapper: Protein functional annotation using the eggNOG database.