All Projects → dib-lab → charcoal

dib-lab / charcoal

Licence: other
Remove contaminated contigs from genomes using k-mers and taxonomies.

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects
Standard ML
205 projects

Projects that are alternatives of or similar to charcoal

ganon
ganon classifies short DNA sequences against large sets of genomic sequences efficiently, with download and update of references (RefSeq/Genbank), taxonomic (NCBI/GTDB) and hierarchical classification, customized reporting and more
Stars: ✭ 57 (+78.13%)
Mutual labels:  metagenomics, k-mer
ORNA
Fast in-silico normalization algorithm for NGS data
Stars: ✭ 21 (-34.37%)
Mutual labels:  metagenomics
catch
A package for designing compact and comprehensive capture probe sets.
Stars: ✭ 55 (+71.88%)
Mutual labels:  metagenomics
metacal
Metagenomics calibration R package
Stars: ✭ 16 (-50%)
Mutual labels:  metagenomics
matam
Mapping-Assisted Targeted-Assembly for Metagenomics
Stars: ✭ 18 (-43.75%)
Mutual labels:  metagenomics
raptor
A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences.
Stars: ✭ 37 (+15.63%)
Mutual labels:  k-mer
melonnpan
Model-based Genomically Informed High-dimensional Predictor of Microbial Community Metabolic Profiles
Stars: ✭ 20 (-37.5%)
Mutual labels:  metagenomics
DRAM
Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes
Stars: ✭ 159 (+396.88%)
Mutual labels:  metagenomics
kmer-db
Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).
Stars: ✭ 68 (+112.5%)
Mutual labels:  k-mer
micca
micca - MICrobial Community Analysis
Stars: ✭ 19 (-40.62%)
Mutual labels:  metagenomics
SemiBin
No description or website provided.
Stars: ✭ 25 (-21.87%)
Mutual labels:  metagenomics
GraphBin
GraphBin: Refined binning of metagenomic contigs using assembly graphs
Stars: ✭ 35 (+9.38%)
Mutual labels:  metagenomics
Maaslin2
MaAsLin2: Microbiome Multivariate Association with Linear Models
Stars: ✭ 76 (+137.5%)
Mutual labels:  metagenomics
DAtest
Compare different differential abundance and expression methods
Stars: ✭ 34 (+6.25%)
Mutual labels:  metagenomics
traitar
From genomes to phenotypes: Traitar, the microbial trait analyzer
Stars: ✭ 41 (+28.13%)
Mutual labels:  metagenomics
Binning refiner
Improving genome bins through the combination of different binning programs
Stars: ✭ 26 (-18.75%)
Mutual labels:  metagenomics
virnet
VirNet: A deep attention model for viral reads identification
Stars: ✭ 26 (-18.75%)
Mutual labels:  metagenomics
functree-ng
An interactive radial tree for functional hierarchies and omics data visualization
Stars: ✭ 18 (-43.75%)
Mutual labels:  metagenomics
bonsai
Bonsai: Fast, flexible taxonomic analysis and classification
Stars: ✭ 66 (+106.25%)
Mutual labels:  metagenomics
Jovian
Metagenomics/viromics pipeline that focuses on automation, user-friendliness and a clear audit trail. Jovian aims to empower classical biologists and wet-lab personnel to do metagenomics/viromics analyses themselves, without bioinformatics expertise.
Stars: ✭ 14 (-56.25%)
Mutual labels:  metagenomics

charcoal

Remove contaminated bits of genomes using k-mer based taxonomic analysis with sourmash.

Still early in development. Buyer beware! Here be dragons!!

Installing!

In brief: clone this repository and change into the top-level repo directory. The file environment.yml contains the necessary conda packages (python and snakemake) to run charcoal; see the Quickstart section for explicit instructions.

Quickstart:

Clone the repository, change into it, create the environment, and activate it:

git clone https://github.com/dib-lab/charcoal
cd ./charcoal/
conda env create -f environment.yml -n charcoal
conda activate charcoal
pip install -e .

Run the demo! (~2 minutes)

To run, execute (in the top-level directory):

python -m charcoal run demo/demo.conf -j 4

This will create two summary files in output.demo/, genome_summary.csv and hit_list_for_filtering.csv. You can open these in your favorite spreadsheet program.

For a friendlier summary, run:

python -m charcoal run demo/demo.conf -j 4 report

This will create a directory output.demo/report/ that contains an index page, index.html, that summarizes the charcoal run. This directory also contain individual genome reports that you can reach through links in the index.

Finally, you can run

python -m charcoal run demo/demo.conf -j 4 clean

and this will produce "cleaned" genomes based on the information in output.demo/hit_list_for_filtering.csv.

Do a full configure & run! (~10 minutes)

Now that the demo runs, you've got everything working! Hooray! Now let's see how to set up a real run, with real databases!

This will take under 10 minutes and under 2 GB of disk space. You'll need about 8 GB of RAM (change -j 4 to -j 1, below, to run it in 2 GB of RAM, albeit 4x slower).

We'll use a set of 10 genomes taken from Nitrogen-fixing populations of Planctomycetes and Proteobacteria are abundant in surface ocean metagenomes, Delmont et al., 2018. These 10 genomes have three eukaryotic and seven bacterial bins.

Install the database.

First, install the sourmash database for GTDB.

charcoal download-db

This will put two files in the db/ directory, totalling 1.5 GB. (You can run this command multiple times and it should only download the databases once.)

Download the example genomes

Next, download and unpack the example genomes:

curl -L https://osf.io/5pej8/download > example-genomes.tar.gz
tar xzf example-genomes.tar.gz
ls example-genomes/

The example-genomes/ directory should have 10 genomes in it. It also has a file provided-lineages.csv which labels the three eukaryotic genomes as d__Eukaryota. This is needed because eukaryotes cannot be automatically classified by charcoal, but they can be decontaminated.

Initiate a new project

Next, create a new project configuration:

charcoal init newproject --genome-dir example-genomes \
    --lineages example-genomes/provided-lineages.csv

This creates two files, newproject.genome-list.txt and newproject.conf. The genome-list file contains the names of all genome files, and the newproject.conf file contains the configuration options for charcoal.

To do a "dry run" of charcoal, which lists out the jobs that will be run, execute:

python -m charcoal run newproject.conf -n

Decontaminate!

And, finally, run the first round of analysis! This will run four processes in parallel (-j 4)

python -m charcoal run newproject.conf -j 4

Examine the results

out of date

The results will be in output.newproject/; see the file combined_summary.csv, as well as the *.clean.fa.gz files, which contain the cleaned contigs. You might also take a look at the *.report.txt files which contain individual genome cleaning reports.

For one example, the summary spreadsheet shows that approximately 10% of TARA_PSW_MAG_00136.fa was removed (column f_removed), and the report in output.newproject/TARA_PSW_MAG_00136.fa.report.txt shows that contigs were removed for being members of a variety of different bacterial lineages.

Need help?

There's more documentation under the doc/ directory.

Please ask questions and file issues on the GitHub issue tracker!

@ctb @taylorreiter May 2020

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].