All Projects → leylabmpi → DeepMAsED

leylabmpi / DeepMAsED

Licence: MIT license
Deep learning for Metagenome Assembly Error Detection

Programming Languages

Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to DeepMAsED

metacal
Metagenomics calibration R package
Stars: ✭ 16 (-33.33%)
Mutual labels:  metagenomics
bonsai
Bonsai: Fast, flexible taxonomic analysis and classification
Stars: ✭ 66 (+175%)
Mutual labels:  metagenomics
MOSCA
Meta-Omics Software for Community Analysis
Stars: ✭ 26 (+8.33%)
Mutual labels:  metagenomics
Maaslin2
MaAsLin2: Microbiome Multivariate Association with Linear Models
Stars: ✭ 76 (+216.67%)
Mutual labels:  metagenomics
Jovian
Metagenomics/viromics pipeline that focuses on automation, user-friendliness and a clear audit trail. Jovian aims to empower classical biologists and wet-lab personnel to do metagenomics/viromics analyses themselves, without bioinformatics expertise.
Stars: ✭ 14 (-41.67%)
Mutual labels:  metagenomics
MetaCoAG
Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs
Stars: ✭ 29 (+20.83%)
Mutual labels:  metagenomics
SemiBin
No description or website provided.
Stars: ✭ 25 (+4.17%)
Mutual labels:  metagenomics
ganon
ganon classifies short DNA sequences against large sets of genomic sequences efficiently, with download and update of references (RefSeq/Genbank), taxonomic (NCBI/GTDB) and hierarchical classification, customized reporting and more
Stars: ✭ 57 (+137.5%)
Mutual labels:  metagenomics
DRAM
Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes
Stars: ✭ 159 (+562.5%)
Mutual labels:  metagenomics
gargammel
gargammel is an ancient DNA simulator
Stars: ✭ 17 (-29.17%)
Mutual labels:  metagenomics
recentrifuge
Recentrifuge: robust comparative analysis and contamination removal for metagenomics
Stars: ✭ 79 (+229.17%)
Mutual labels:  metagenomics
traitar
From genomes to phenotypes: Traitar, the microbial trait analyzer
Stars: ✭ 41 (+70.83%)
Mutual labels:  metagenomics
metacherchant
No description or website provided.
Stars: ✭ 19 (-20.83%)
Mutual labels:  metagenomics
functree-ng
An interactive radial tree for functional hierarchies and omics data visualization
Stars: ✭ 18 (-25%)
Mutual labels:  metagenomics
PlasFlow
Software for prediction of plasmid sequences in metagenomic assemblies
Stars: ✭ 74 (+208.33%)
Mutual labels:  metagenome-assembly
micca
micca - MICrobial Community Analysis
Stars: ✭ 19 (-20.83%)
Mutual labels:  metagenomics
charcoal
Remove contaminated contigs from genomes using k-mers and taxonomies.
Stars: ✭ 32 (+33.33%)
Mutual labels:  metagenomics
sunbeam
A robust, extensible metagenomics pipeline
Stars: ✭ 143 (+495.83%)
Mutual labels:  metagenomics
kraken-biom
Create BIOM-format tables (http://biom-format.org) from Kraken output (http://ccb.jhu.edu/software/kraken/, https://github.com/DerrickWood/kraken).
Stars: ✭ 35 (+45.83%)
Mutual labels:  metagenomics
StrainFLAIR
Strain-level abundances estimation in metagenomic samples using variation graphs
Stars: ✭ 23 (-4.17%)
Mutual labels:  metagenomics

DeepMAsED

DeepMAsED

Deep learning for Metagenome Assembly Error Detection (DeepMAsED)

"mased"

Middle English term: misled, bewildered, amazed, or perplexed

Citation

Mineeva, Olga, Mateo Rojas-Carulla, Ruth E. Ley, Bernhard Schölkopf, and Nicholas D. Youngblut. 2020. "DeepMAsED: Evaluating the Quality of Metagenomic Assemblies." Bioinformatics , February.

Main Description

The tool is divided into two main parts:

  • DeepMAsED-SM
    • A snakemake pipeline for generating DeepMAsED train/test datasets from reference genomes
  • DeepMAsED-DL
    • A python package for misassembly detection via deep learning

Setup

Via the conda recipe

The simplest approach is to use the conda recipe:

conda create -n deepmased bioconda::deepmased

[alternative] The piecemeal setup

Dependency setup via conda

  • [If needed] Install miniconda (or anaconda)
  • See the conda create line in the .travis.yml file.
  • If just using DeepMAsED-SM:
    • conda create -n snakemake conda-forge::pandas bioconda::snakemake

Testing the DeepMAsED package (optional)

pytest -s

Installing the DeepMAsED package into the conda environment

  • Via setup.py
    • python setup.py install
  • Via pip
    • pip install DeepMAsED

Usage

Example of classifying contig misassemblies

You need to have the following input:

  • fasta of metagenome assembly contigs (uncompressed)
  • BAM file of metagenome reads mapped to the contigs

Create table mapping BAM & fasta files

If multiple sets of contigs (eg., MAGs) and BAM files, then which contigs go with which BAM files?

Create a tab-delim table of: bam<tab>fasta (header required)

This will be your bam_fasta_table, which is need for creating the features.

Create feature table(s)

DeepMAsED features $bam_fasta_table

This generates >=1 feature table and a table listing all output files (the "feature_file_table"). This feature_file_table will be the input for predict

Predict misassemblies

DeepMAsED predict $feature_file_table

...where feature_filt_table is the path to a table that lists all feature files (see above).

--force-ovewrite forces the re-creation of the pkl files, which is a bit slower but can prevent issues.

Change --save-path to set the output directory. Use --cpu-only to just use CPUs instead of a GPU.

Third, inspect the output

By default, the predictions will be written to deepmased_predictions.tsv.

Example output
Collection     Contig  Deepmased_score
0       NODE_1156_length_5232_cov_4.046938      0.0007264018
0       NODE_1563_length_3868_cov_5.851298      0.03783685
0       NODE_4288_length_1225_cov_3.235897      0.070887744
1       k141_9081       8.8751316e-05
1       k141_2594       6.720424e-05
1       k141_4878       0.0015754104
2       NODE_5204_length_1290_cov_3.283401      0.00036007166
2       NODE_2848_length_2164_cov_2.982456      0.0005029738
2       NODE_446_length_6027_cov_5.812291       0.068261534

See Mineeva et al., 2020 to help decide what score cutoff is prudent for classifying misassembled contigs.

Creating training datasets with DeepMAsED-SM

This is useful for training DeepMAsED-DL with a custom train/test dataset (e.g., just biome-specific taxa).

Input

  • A table listing refernce genomes. Two possible formats:
    • Genome-accession: <Taxon>\t<Accession>
      • "Taxon" = the species/strain name
      • "Accession" = the NCBI genbank genome accession
      • The genomes will be downloaded based on the accession
    • Genome-fasta: <Taxon>\t<Fasta>
      • "Taxon" = the species/strain name of the genome
      • "Fasta" = the fasta of the genome sequence
      • Use this option if you already have the genome fasta files (uncompressed or gzip'ed)
  • The snakemake config file (e.g., config.yaml). This includes:
    • Config params on MG communities
    • Config params on assemblers & parameters

The column order for the tables doesn't matter, but the column names must be exact.

Running locally

See the "Setup" section above for snakemake installation instructions.

cd ./DeepMAsED-SM/

Edit the config.yaml file as needed (eg., changing input & output paths)

snakemake --use-conda -j <NUMBER_OF_THREADS> --configfile <MY_CONFIG.yaml_FILE>

Running on SGE cluster

./snakemake_sge.sh <MY_CONFIG.yaml_FILE> cluster.json <PATH_FOR_SGE_LOGS> <NUMBER_OF_PARALLEL_JOBS> [additional snakemake options]

It should be rather easy to update the code to run on other cluster architectures. See the following resources for help:

Output

The output will the be same as for feature generation, but with extra directories:

  • ./output/genomes/
    • Reference genomes
  • ./output/MGSIM/
    • Simulated metagenomes
  • ./output/assembly/
    • Metagenome assemblies
  • ./output/true_errors/
    • Metagenome assembly errors determined by using the references
  • ./output/map/
    • Feature tables for each simulation

DeepMAsED-DL

Main interface: DeepMAsED -h

DeepMAsED [train|predict] can be run without GPUs, but the will be substantially slower.

Predicting with existing model

See DeepMAsED predict -h

Training a new model

See DeepMAsED train -h

Evaluating a model

See DeepMAsED evalulate -h

Creating features for predict

See DeepMAsED features -h

Features table

  • Basic info
    • assembler
      • metagenome assembler used
    • contig
      • contig ID
    • position
      • position on the contig (bp)
    • ref_base
      • nucleotide at that position on the contig
  • Extracted from the bam file
    • num_query_A
      • number of reads mapping to that position with 'A'
    • num_query_C
      • number of reads mapping to that position with 'C'
    • num_query_G
      • number of reads mapping to that position with 'G'
    • num_query_T
      • number of reads mapping to that position with 'T'
    • num_SNPs
      • number of SNPs at that position
    • coverage
      • number of reads mapping to that position
    • num_discordant
      • discordant reads according to the read mapper definition
    • num_supplementary
      • number of reads mapping to that position where the alignment is supplementary
      • see the samtools docs for more info
    • num_secondary
      • number of reads mapping to that position where the alignment is secondary
      • see the samtools docs for more info
  • MetaQUAST info
    • Extensive_misassembly
      • the "extensive misassembly" classification set by MetaQUAST
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].