All Projects → EBI-Metagenomics → genomes-pipeline

EBI-Metagenomics / genomes-pipeline

Licence: other
MGnify genome analysis pipeline

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects
perl
6916 projects
Dockerfile
14818 projects
r
7636 projects

MGnify genome analysis pipeline

MGnify CWL pipeline to characterize a set of isolate or metagenome-assembled genomes (MAGs) using the workflow described in the following publication:

A Almeida, S Nayfach, M Boland, F Strozzi, M Beracochea, ZJ Shi, KS Pollard, E Sakharova, DH Parks, P Hugenholtz, N Segata, NC Kyrpides and RD Finn. (2020) A unified catalog of 204,938 reference genomes from the human gut microbiome. Nature Biotechnol. doi: https://doi.org/10.1038/s41587-020-0603-3

Clone repo

git clone https://github.com/EBI-Metagenomics/genomes-pipeline.git
cd genomes-pipeline

Installation with Docker

  1. Install all necessary tools (better use separate env):
  1. Add python scripts to PATH
export PATH=${PATH}:docker/python3_scripts:docker/genomes-catalog-update/scripts

All dockers were pushed on DockerHub. If you want to re-build dockers:

cd docker
bash build.sh

Installation without Docker

  1. Install the necessary dependencies:
  1. Add custom scripts to your $PATH environment.
export PATH=${PATH}:docker/genomes-catalog-update/scriptsexport 
export PATH=${PATH}:docker/python3_scripts
export PATH=${PATH}:docker/bash
export PATH=${PATH}:docker/detect_rRNA
export PATH=${PATH}:docker/GUNC
export PATH=${PATH}:docker/mash2nwk
export PATH=${PATH}:docker/mmseqs

Download databases

bash download_db.sh

Run

Note: You can manually change parameters of MMseqs2 for protein clustering in your YML file (arguments mmseqs_limit_i, mmseq_limit_annotation, mmseqs_limit_c)

  1. You need to pre-download your data to directory (GENOMES) and make sure that all genomes are not compressed
  2. Create YML file with our help-script:
export GENOMES=


python3 installation/create_yml.py \
        -d ${GENOMES} ...

Pipeline structure

Pipeline overview

Output files/folders:

MGYG...NUM
         --- genome
              --- fa
              --- fa.fai
              --- faa (main rep)
              --- gff (main rep)
         --- pan-genome
              --- core_genes.txt
              --- <cluster>_mashtree.nwk
              --- pan_genome_reference.fa
              --- gene_presence_absence.Rtab
   MGYG...NUM
         --- genome
              --- fa
              --- fa.fai
              --- gff
              --- faa
  mmseqs_cluster_rep.emapper.annotations 
  mmseqs_cluster_rep.emapper.seed_orthologs
  mmseqs_cluster_rep.IPS.tsv

  intermediate_files/
         --- clusters_split.txt
         --- drep-filt-list.txt
         --- extra_weight_table.txt
         --- gunc_report_completed.txt
         --- names.tsv
         --- renamed_download.csv
         --- Sdb.csv
         --- mmseq.tsv
  gtdb-tk_output/ ( commented yet)
  rRNA_fastas/
  rRNA_outs/
  GFFs/
  mmseqs_output/
        mmseqs_0.5_outdir.tar.gz
        mmseqs_0.95_outdir.tar.gz
        mmseqs_0.9_outdir.tar.gz
        mmseqs_1.0_outdir.tar.gz
  panaroo_output/
        MGYG.._panaroo.tar.gz
        ...
  per-genome-annotations/ (for post-processing)
  drep_genomes/                   (for GTDB-Tk)

Tool description

  • CheckM: Estimate genome completeness and contamination.
  • GTDB-Tk: Genome taxonomic assignment using the GTDB framework.
  • dRep: Genome de-replication.
  • Mash2Nwk: Generate Mash distance tree of conspecific genomes.
  • Prokka: Predict protein-coding sequences from genome assembly.
  • MMseqs2: Cluster protein-coding sequences.
  • InterProScan: Protein functional annotation using the InterPro database.
  • eggNOG-mapper: Protein functional annotation using the eggNOG database.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].