ganon

ganon classifies short DNA sequences against large sets of genomic reference sequences efficiently. It automatically downloads, builds and updates commonly used datasets (refseq/genbank), performs taxonomic (ncbi or gtdb) and hierarchical classification, generates custom reports and tables among many other features.

Quick install/usage guide
Details
Examples
Output files
Building customized databases
Multiple and Hierarchical classification
Choosing and explaining parameters
Parameters

Quick install/usage guide

Install with conda

conda install -c bioconda -c conda-forge ganon

Download and build

# Archaeal complete genome sequences from NCBI RefSeq
ganon build --db-prefix arc_cg_rs --source refseq --organism-group archaea --complete-genomes --threads 12

Classify

ganon classify --db-prefix arc_cg_rs --output-prefix classify_results --single-reads my_reads.fq.gz --threads 12

Re-generate reports and create tables from multiple reports

ganon report --db-prefix arc_cg_rs --input classify_results.rep --output-prefix filtered_report --min-count 0.01
ganon table --input classify_results.tre filtered_report.tre --output-file output_table.tsv --top-sample 10

Update the database at a later time point

ganon update --db-prefix arc_cg_rs --threads 12

More examples

Details

ganon is designed to index large sets of genomic reference sequences and to classify short reads against them efficiently. The tool uses Interleaved Bloom Filters as indices based on k-mers/minimizers. It was mainly developed, but not limited, to the metagenomics classification problem: quickly assign short fragments to their closest reference among thousands of references.

Features

NCBI and GTDB native support for taxonomic classification
integrated download of commonly used reference sequences from RefSeq/Genbank (ganon build)
update indices incrementally (ganon update)
customizable build for pre-downloaded or non-standard sequence files (ganon build-custom)
build and classify at different taxonomic levels, file, sequence, strain/assembly or custom specialization
perform hierarchical classification: use several databases in any order
report the lowest common ancestor (LCA) but also multiple and unique matches for every read
generate reports and tables for multi-sample studies with filters and further customizations

ganon achieved very good results in our own evaluations but also in independent evaluations: LEMMI, LEMMI v2 and CAMI2

Installation guide

The easiest way to install ganon is via conda, using the bioconda and conda-forge channels:

conda install -c bioconda -c conda-forge ganon

However, there are possible performance benefits compiling ganon from source in the target machine rather than using the conda version. To do so, please follow the instructions below:

Instructions

build dependencies

System packages:

gcc >=7
cmake >=3.10
zlib

run dependencies

System packages:

python >=3.6
pandas >=1.1.0
multitax >=1.1.1

python3 -V # >=3.6
python3 -m pip install "pandas>=1.1.0" "multitax>=1.1.1"

Downloading and building ganon + submodules

git clone --recurse-submodules https://github.com/pirovc/ganon.git

cd ganon
python3 setup.py install --record files.txt #optional
mkdir build_cpp
cd build_cpp
cmake -DCMAKE_BUILD_TYPE=Release -DVERBOSE_CONFIG=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCONDA=OFF ..
make
sudo make install #optional

to change install location (e.g. /myprefix/bin/), set the installation prefix in the cmake command with -DCMAKE_INSTALL_PREFIX=/myprefix/
use -DINCLUDE_DIRS to set alternative paths to cxxopts and Catch2 libs.

If everything was properly installed, the following commands should show the help pages without errors:

ganon -h

Run tests

python3 -m unittest discover -s tests/ganon/integration/
python3 -m unittest discover -s tests/ganon/integration_online/ #optional - downloads large files
cd build_cpp/
ctest -VV .

Examples

Commonly used reference set for metagenomics analysis (complete genomes, NCBI RefSeq, archaea+bacteria+fungi+viral)

ganon build --db-prefix abfv_rs_cg --organism-group archaea bacteria fungi viral --source refseq --taxonomy ncbi --complete-genomes --threads 12

Top 3 bacterial genomes (for each taxa) from NCBI RefSeq

ganon build --db-prefix bac_rs_top3 --organism-group bacteria --source refseq --taxonomy ncbi --top 3 --threads 12

Complete GTDB database

ganon build --db-prefix complete_gtdb --organism-group archaea bacteria --source refseq genbank --taxonomy gtdb --threads 12

Database based on specific taxonomic identifiers (203492 - Fusobacteriaceae)

# NCBI
ganon build --db-prefix fuso_ncbi --taxid "203492" --source refseq genbank --taxonomy ncbi --threads 12
# GTDB
ganon build --db-prefix fuso_gtdb --taxid "f__Fusobacteriaceae" --source refseq genbank --taxonomy gtdb --threads 12

Customized database at assembly level

ganon build-custom --db-prefix my_db --input my_big_fasta_file.fasta.gz --level assembly --threads 12

Customized database at species level build based on files previously downloaded with genome_updater

ganon build-custom --db-prefix custom_db_gu --input myfiles/2022-06-28_10-02-14/files/ --level species --ncbi-file-info outfolder/2022-06-28_10-02-14/assembly_summary.txt --threads 12

Customized database with sequence as target (to classify reads at sequence level)

ganon build-custom --db-prefix seq_target --input myfiles/2022-06-28_10-02-14/files/ --ncbi-file-info outfolder/2022-06-28_10-02-14/assembly_summary.txt --input-target sequence --threads 12

Output files

build/update

Every run on ganon build, ganon build-custom or ganon update will generate the following database files:

{prefix}.ibf: main interleaved bloom filter index file
{prefix}.tax: taxonomic tree (fields: target/node, parent, rank, name) (only if --taxonomy is used)
{prefix}_files/: folder containing downloaded reference sequence and auxiliary files. Not necessary for classification. Keep this folder if the database will be update later. Otherwise it can be deleted.

Obs: Database files generated with version 1.2.0 or higher are not compatible with older versions.

classify

{prefix}.rep: plain report of the run with only targets that received a match (fields: 1) hierarchy_label, 2) target, 3) total matches, 4) unique reads, 5) lca reads, 6) rank, 7) name). At the end prints 2 extra lines with #total_classified and #total_unclassified
{prefix}.lca: output with one match for each classified read after LCA. Only generated with --output-lca active. If multiple hierarchy levels are set, one file for each level will be created: {prefix}.{hierarchy}.lca (fields: read identifier, target, (max) k-mer/minimizer count)
{prefix}.all: output with all matches for each read. Only generated with --output-all active Warning: file can be very large. If multiple hierarchy levels are set, one file for each level will be created: {prefix}.{hierarchy}.all (fields: 1) read identifier, 2) target, 3) k-mer/minimizer count)
{prefix}.tre: report file (see below)

report

{prefix}.tre: tab-separated tree-like report with cumulative counts and taxonomic lineage. By default, this is a read-based report where each read classified is counted once. It is possible to generate this for all read matches (ganon report --report-type matches). In this case, single and shared matches are reported to their target. Each line in this report is a taxonomic entry, with the following fields:

taxonomic rank (e.g. phylum, species, ...)
target (e.g. taxid/specialization)
target lineage (e.g 1|2|1224|...)
target name (e.g. Paenibacillus polymyxa)
# unique assignments (number of reads that matched exclusively to this target)
# shared assignments (number of reads with non-unique matches directly assigned to this target. Represents the lca matches (--report-type reads) or shared matches (--report-type matches))
# children assignments (number of reads assigned to all children nodes of this target)
# cumulative assignments (the sum of the unique, shared and children reads/matches assigned up-to this target)
% cumulative assignments

Using --report-type reads the first line of the file will show the number of unclassified reads
The sum of cumulative assignments for the unclassified and root lines should be 100%. The final cumulative sum of reads/matches may be under 100% if any filter is successfully applied and/or hierarchical selection is selected (keep/skip/split).
When --report-type reads only taxa that received direct read matches, either unique or through lca, are considered. Some reads may have only shared matches and will not be reported directly (but will be accounted on some parent level). To access those matches, create a report with --report-type matches or use directly the file {prefix}.rep.

table

{output_file}: a tab-separated file with counts/percentages of taxa for multiple samples

Examples of output files

The main output file is the {prefix}.tre which will summarize the results:

unclassified                                                 unclassified             0   0  0   2   2.02020
root          1       1                                      root                     0   0  97  97  97.97980
superkingdom  2       1|2                                    Bacteria                 0   0  97  97  97.97980
phylum        1239    1|2|1239                               Firmicutes               0   0  57  57  57.57576
phylum        1224    1|2|1224                               Proteobacteria           0   0  40  40  40.40404
class         91061   1|2|1239|91061                         Bacilli                  0   0  57  57  57.57576
class         28211   1|2|1224|28211                         Alphaproteobacteria      0   0  28  28  28.28283
class         1236    1|2|1224|1236                          Gammaproteobacteria      0   0  12  12  12.12121
order         1385    1|2|1239|91061|1385                    Bacillales               0   0  57  57  57.57576
order         204458  1|2|1224|28211|204458                  Caulobacterales          0   0  28  28  28.28283
order         72274   1|2|1224|1236|72274                    Pseudomonadales          0   0  12  12  12.12121
family        186822  1|2|1239|91061|1385|186822             Paenibacillaceae         0   0  57  57  57.57576
family        76892   1|2|1224|28211|204458|76892            Caulobacteraceae         0   0  28  28  28.28283
family        468     1|2|1224|1236|72274|468                Moraxellaceae            0   0  12  12  12.12121
genus         44249   1|2|1239|91061|1385|186822|44249       Paenibacillus            0   0  57  57  57.57576
genus         75      1|2|1224|28211|204458|76892|75         Caulobacter              0   0  28  28  28.28283
genus         469     1|2|1224|1236|72274|468|469            Acinetobacter            0   0  12  12  12.12121
species       1406    1|2|1239|91061|1385|186822|44249|1406  Paenibacillus polymyxa   57  0  0   57  57.57576
species       366602  1|2|1224|28211|204458|76892|75|366602  Caulobacter sp. K31      28  0  0   28  28.28283
species       470     1|2|1224|1236|72274|468|469|470        Acinetobacter baumannii  12  0  0   12  12.12121

running ganon classify or ganon report with --ranks all, the output will show all ranks used for classification and presented sorted by lineage (also available with ganon report --sort lineage):

unclassified                                                                  unclassified                                   0   0  0   2   2.02020
root           1        1                                                     root                                           0   0  97  97  97.97980
no rank        131567   1|131567                                              cellular organisms                             0   0  97  97  97.97980
superkingdom   2        1|131567|2                                            Bacteria                                       0   0  97  97  97.97980
phylum         1224     1|131567|2|1224                                       Proteobacteria                                 0   0  40  40  40.40404
class          1236     1|131567|2|1224|1236                                  Gammaproteobacteria                            0   0  12  12  12.12121
order          72274    1|131567|2|1224|1236|72274                            Pseudomonadales                                0   0  12  12  12.12121
family         468      1|131567|2|1224|1236|72274|468                        Moraxellaceae                                  0   0  12  12  12.12121
genus          469      1|131567|2|1224|1236|72274|468|469                    Acinetobacter                                  0   0  12  12  12.12121
species group  909768   1|131567|2|1224|1236|72274|468|469|909768             Acinetobacter calcoaceticus/baumannii complex  0   0  12  12  12.12121
species        470      1|131567|2|1224|1236|72274|468|469|909768|470         Acinetobacter baumannii                        12  0  0   12  12.12121
class          28211    1|131567|2|1224|28211                                 Alphaproteobacteria                            0   0  28  28  28.28283
order          204458   1|131567|2|1224|28211|204458                          Caulobacterales                                0   0  28  28  28.28283
family         76892    1|131567|2|1224|28211|204458|76892                    Caulobacteraceae                               0   0  28  28  28.28283
genus          75       1|131567|2|1224|28211|204458|76892|75                 Caulobacter                                    0   0  28  28  28.28283
species        366602   1|131567|2|1224|28211|204458|76892|75|366602          Caulobacter sp. K31                            28  0  0   28  28.28283
no rank        1783272  1|131567|2|1783272                                    Terrabacteria group                            0   0  57  57  57.57576
phylum         1239     1|131567|2|1783272|1239                               Firmicutes                                     0   0  57  57  57.57576
class          91061    1|131567|2|1783272|1239|91061                         Bacilli                                        0   0  57  57  57.57576
order          1385     1|131567|2|1783272|1239|91061|1385                    Bacillales                                     0   0  57  57  57.57576
family         186822   1|131567|2|1783272|1239|91061|1385|186822             Paenibacillaceae                               0   0  57  57  57.57576
genus          44249    1|131567|2|1783272|1239|91061|1385|186822|44249       Paenibacillus                                  0   0  57  57  57.57576
species        1406     1|131567|2|1783272|1239|91061|1385|186822|44249|1406  Paenibacillus polymyxa                         57  0  0   57  57.57576

Building customized databases

Besides the automated download and build (ganon build) ganon provides a highly customizable build procedure (ganon build-custom) to create databases.

To use custom sequences, just provide them with --input. ganon will try to retrieve all necessary information necessary to build a database.

ganon expects assembly accessions if building by file (e.g. filename should be similar as GCA_002211645.1_ASM221164v1_genomic.fna.gz) or accession version if building by sequence (e.g. headers should look like >CP022124.1 Fusobacterium nu...). More information about building by file or sequence can be found here.

It is also possible to use non-standard accessions and headers to build databases with --input-file. This file should contain the following fields (tab-separated): file, [target, node, specialization, specialization_name].

Examples of --input-file

Using --input-target sequence:

sequences.fasta HEADER1
sequences.fasta HEADER2
sequences.fasta HEADER3
others.fasta HEADER4
others.fasta HEADER5

or using --input-target file:

sequences.fasta FILE_A
others.fasta FILE_B

Nodes can be provided to link the data with taxonomy. For example (using --taxonomy ncbi):

sequences.fasta HEADER1 562
sequences.fasta HEADER2 562
sequences.fasta HEADER3 562
others.fasta HEADER4  623
others.fasta HEADER5  623

Specialization can be used to create a additional classification level after the taxonomic leaves. For example (using --level custom):

sequences.fasta HEADER1 562 ID443 Escherichia coli TW10119
sequences.fasta HEADER2 562 ID297 Escherichia coli PCN079
sequences.fasta HEADER3 562 ID8873  Escherichia coli P0301867.7
others.fasta HEADER4  623 ID2241  Shigella flexneri 1a
others.fasta HEADER5  623 ID4422  Shigella flexneri 1b

Multiple and Hierarchical classification

ganon classify can be performed in multiple databases at the same time. The databases can also be provided in a hierarchical order.

Multiple database classification can be performed providing several inputs for --db-prefix. They are required to be built with the same --kmer-size and --window-size values. Multiple databases are considered as one (as if built together) and redundancy in content (same reference in two or more databases) is allowed.

To classify reads in a hierarchical order, --hierarchy-labels should be provided. When using multiple hierarchical levels, output files will be generated for each level (use --output-single to generate a single output from multiple hierarchical levels). Please note that some parameters are set for each database (e.g. --rel-cutoff) while others are set for each hierarchical level (e.g. --rel-filter)

Examples

Classifying reads against multiple databases:

ganon classify --db-prefix db1 db2 db3 \
               --rel-cutoff 0.75 \
               --single-reads reads.fq.gz

Classification against 3 database (as if they were one) using the same cutoff.

Classifying reads against multiple databases with different cutoffs:

ganon classify --db-prefix  db1 db2 db3 \
               --rel-cutoff 0.2 0.3 0.1 \
               --single-reads reads.fq.gz

Classification against 3 database (as if they were one) using different error rates for each.

Classifying reads against multiple databases hierarchically:

ganon classify --db-prefix            db1     db2      db3 \
               --hierarchy-labels 1_first 1_first 2_second \
               --single-reads reads.fq.gz

In this example, reads are going to be classified first against db1 and db2. Reads without a valid match will be further classified against db3. --hierarchy-labels are strings and are going to be sorted to define the hierarchy order, disregarding input order.

Classifying reads against multiple databases hierarchically with different cutoffs:

ganon classify --db-prefix            db1     db2      db3 \
               --hierarchy-labels 1_first 1_first 2_second \
               --rel-cutoff             1     0.5     0.25 \
               --rel-filter           0.1              0.5 \
               --single-reads reads.fq.gz

In this example, classification will be performed with different --rel-cutoff for each database. For each hierarchy levels (1_first and 2_second) a different --rel-filter will be used.

Choosing and explaining parameters

ganon build

filter false positive and size (--max-fp, --filter-size)

ganon indices are based on bloom filters and can have false positive matches. This can be controlled with --max-fp parameter. The lower the --max-fp, the less chances of false positives, but the larger the database size will be. For example, with --max-fp 0.01 the database will be build so any target (e.g. assembly, specificed with --level) will have 1 in a 100 change of reporting a false match (between minimizer k-mers).

Alternatively, one can set a specific size for the final index with --filer-size. When using this option, please observe the theoretic false positive of the index reported at the end of the building process.

minimizers (--window-size, --kmer-size)

in ganon build, when --window-size > --kmer-size minimizers are used. That means that for a every window, a single k-mer will be selected. It produces smaller database files and requires substantially less memory overall. It may increase building times but will have a huge benefit for classification times. Sensitivity and precision can be reduced by small margins. If --window-size = --kmer-size, all k-mers are going to be used to build the database.

ganon classify

reads (--single-reads, --paired-reads)

ganon accepts single-end and paired-end reads. Both types can be use at the same time. In paired-end mode, reads are always reported with the header of the first pair. Paired-end reads are classified in a standard forward-reverse orientation. The max. read length accepted is 65535 (accounting both reads in paired mode).

cutoff and filter (--rel-cutoff, --rel-filter)

ganon has two parameters to control a match between reads and references: --rel-cutoff and --rel-filter.

Every read can be classified against none, one or more references. What will be reported is the remaining matches after cutoff and filter thresholds are applied, based on the number of shared minimizers (or k-mers) between sequences.

The cutoff is the first. It should be set as a minimal value to consider a match between a read and a reference. Next the filter is applied to the remaining matches. filter thresholds are relative to the best scoring match and control how far from the best match further matches are allowed. cutoff can be interpreted as the lower bound to discard spurious matches and filter as the fine tuning to control what to keep.

Example

Using --kmer-size 19 (and --window-size 19 to simplify the example), a certain read (100bp) has the following matches with the 5 references (ref1..5), sorted by shared k-mers:

reference	shared k-mers
ref1	82
ref2	68
ref3	44
ref4	25
ref5	20

this read can have at most 82 shared k-mers (100-19+1=82). With --rel-cutoff 0.25, the following matches will be discarded:

reference	shared k-mers	--rel-cutoff 0.25
ref1	82
ref2	68
ref3	44
ref4	25
~~ref5~~	20	X

since the --rel-cutoff threshold is 82 * 0.25 = 21 (ceiling is applied). Further, with --rel-filter 0.3, the following matches will be discarded:

reference	shared k-mers	--rel-cutoff 0.25	--rel-filter 0.3
ref1	82
ref2	68
~~ref3~~	44		X
~~ref4~~	25		X
~~ref5~~	20	X

since best match is 82, the filter parameter is removing any match below 0.3 * 82 = 57 (ceiling is applied) shared k-mers. ref1 and ref2 are reported as matches.

For databases built with --window-size, the relative values are not based on the maximum number of possible shared k-mers but on the actual number of unique minimizers extracted from the read.

A different cutoff can be set for every database in a multiple or hierarchical database classification. A different filter can be set for every level of a hierarchical database classification.

Note that reads that remain with only one reference match (after cutoff and filter are applied) are considered a unique match.

ganon build-custom

Target file or sequence (--input-target)

Customized builds can be done either by file or sequence. --input-target file will consider every file provided with --input a single unit. --input-target sequence will use every sequence as a unit.

--input-target file is the default behavior and most efficient way to build databases. --input-target sequence should only be used when the input sequences are stored in a single file or when classification at sequence level is desired.

Build level (--level)

The --level parameter defines the max. depth of the database for classification. This parameter is relevant because the --max-fp is going to be guaranteed at the --level chosen. By default, the level will be the same as --input-target, meaning that classification will be done either at file or sequence level.

Alternatively, --level assembly will link the file or sequence target information with assembly accessions retrieved from NCBI servers. --level leaves or --level species (or genus, family, ...) will link the targets with taxonomic information and prune the tree at the chosen level. --level custom will use specialization level define in the --input-file.

Retrieving info (--ncbi-sequence-info, --ncbi-file-info)

Further taxonomy and assembly linking information has to be collected to properly build the database. --ncbi-sequence-info and --ncbi-file-info allow customizations on this step.

--ncbi-sequence-info (used when --input-target sequence) allows the use of NCBI e-utils webservices or downloads accession2taxid files to extract target information. By default uses e-utils up-to 50000 sequences or downloads nucl_gb nucl_wgs from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/ otherwise. Previously downloaded files can be directly provided.

--ncbi-file-info (used when --input-target file) downloads assembly_summary.txt files to extract target information from https://ftp.ncbi.nlm.nih.gov/genomes/. Previously downloaded files can be directly provided.

Parameters

usage: ganon [-h] [-v] {build,build-custom,update,classify,report,table} ...

- - - - - - - - - -
   _  _  _  _  _   
  (_|(_|| |(_)| |  
   _|   v. 1.2.0
- - - - - - - - - -

positional arguments:
  {build,build-custom,update,classify,report,table}
    build               Download and build ganon default databases
                        (refseq/genbank)
    build-custom        Build custom ganon databases
    update              Update ganon default databases
    classify            Classify reads against built databases
    report              Generate reports from classification results
    table               Generate table from reports

options:
  -h, --help            show this help message and exit
  -v, --version         Show program's version number and exit.

ganon build

usage: ganon build [-h] [-g [...]] [-a [...]] [-b [...]] [-o] [-c] [-u] [-m [...]] -d DB_PREFIX [-x] [-t] [-p] [-f] [-k]
                   [-w] [-s] [--restart] [--verbose] [--quiet]

options:
  -h, --help            show this help message and exit

required arguments:
  -g [ ...], --organism-group [ ...]
                        One or more organism groups to download [archaea,bacteria,fungi,human,invertebrate,metagenomes,o
                        ther,plant,protozoa,vertebrate_mammalian,vertebrate_other,viral]. Mutually exclusive --taxid
                        (default: None)
  -a [ ...], --taxid [ ...]
                        One or more taxonomic identifiers to download. e.g. 562 (-x ncbi) or 's__Escherichia coli' (-x
                        gtdb). Mutually exclusive --organism-group (default: None)
  -d DB_PREFIX, --db-prefix DB_PREFIX
                        Database output prefix (default: None)

download arguments:
  -b [ ...], --source [ ...]
                        Source to download [refseq,genbank] (default: ['refseq'])
  -o , --top            Download limited assemblies for each taxa. 0 for all. (default: 0)
  -c, --complete-genomes
                        Download only sub-set of complete genomes (default: False)
  -u , --genome-updater 
                        Additional genome_updater parameters (https://github.com/pirovc/genome_updater) (default: None)
  -m [ ...], --taxonomy-files [ ...]
                        Specific files for taxonomy - otherwise files will be downloaded (default: None)

important arguments:
  -x , --taxonomy       Set taxonomy to enable taxonomic classification, lca and reports [ncbi,gtdb,skip] (default:
                        ncbi)
  -t , --threads 

advanced arguments:
  -p , --max-fp         Max. false positive rate for bloom filters Mutually exclusive --filter-size. (default: 0.05)
  -f , --filter-size    Fixed size for filter in Megabytes (MB). Mutually exclusive --max-fp. (default: 0)
  -k , --kmer-size      The k-mer size to split sequences. (default: 19)
  -w , --window-size    The window-size to build filter with minimizers. (default: 32)
  -s , --hash-functions 
                        The number of hash functions for the interleaved bloom filter [0-5]. 0 to detect optimal value.
                        (default: 0)

optional arguments:
  --restart             Restart build/update from scratch, do not try to resume from the latest possible step.
                        {db_prefix}_files/ will be deleted if present. (default: False)
  --verbose             Verbose output mode (default: False)
  --quiet               Quiet output mode (default: False)

ganon build-custom

usage: ganon build-custom [-h] [-i [...]] [-e] [-n] [-a] [-l] [-m [...]] [--write-info-file] [-r [...]] [-q [...]] -d
                          DB_PREFIX [-x] [-t] [-p] [-f] [-k] [-w] [-s] [--restart] [--verbose] [--quiet]

options:
  -h, --help            show this help message and exit

required arguments:
  -i [ ...], --input [ ...]
                        Input file(s) and/or folder(s). Mutually exclusive --input-file. (default: None)
  -e , --input-extension 
                        Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *).
                        (default: fna.gz)
  -d DB_PREFIX, --db-prefix DB_PREFIX
                        Database output prefix (default: None)

custom arguments:
  -n , --input-file     Manually set information for input files: file <tab> [target <tab> node <tab> specialization
                        <tab> specialization name]. target is the sequence identifier if --input-target sequence (file
                        can be repeated for multiple sequences). if --input-target file and target is not set, filename
                        is used. node is the taxonomic identifier. Mutually exclusive --input (default: None)
  -a , --input-target   Target to use [file, sequence]. By default: 'file' if multiple input files are provided or
                        --input-file is set, 'sequence' if a single file is provided. Using 'file' is recommended and
                        will speed-up the building process (default: None)
  -l , --level          Use a specialized target to build the database. By default, --level is the --input-target.
                        Options: any available taxonomic rank [species, genus, ...] or 'leaves' (requires --taxonomy).
                        Further specialization options [assembly,custom]. assembly will retrieve and use the assembly
                        accession and name. custom requires and uses the specialization field in the --input-file.
                        (default: None)
  -m [ ...], --taxonomy-files [ ...]
                        Specific files for taxonomy - otherwise files will be downloaded (default: None)
  --write-info-file     Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for
                        further attempts. (default: False)

ncbi arguments:
  -r [ ...], --ncbi-sequence-info [ ...]
                        Uses NCBI e-utils webservices or downloads accession2taxid files to extract target information.
                        [eutils,nucl_gb,nucl_wgs,nucl_est,nucl_gss,pdb,prot,dead_nucl,dead_wgs,dead_prot or one or more
                        accession2taxid files from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/]. By
                        default uses e-utils up-to 50000 sequences or downloads nucl_gb nucl_wgs otherwise. (default:
                        [])
  -q [ ...], --ncbi-file-info [ ...]
                        Downloads assembly_summary files to extract target information.
                        [refseq,genbank,refseq_historical,genbank_historical or one or more assembly_summary files from
                        https://ftp.ncbi.nlm.nih.gov/genomes/] (default: ['refseq', 'genbank'])

important arguments:
  -x , --taxonomy       Set taxonomy to enable taxonomic classification, lca and reports [ncbi,gtdb,skip] (default:
                        ncbi)
  -t , --threads 

advanced arguments:
  -p , --max-fp         Max. false positive rate for bloom filters Mutually exclusive --filter-size. (default: 0.05)
  -f , --filter-size    Fixed size for filter in Megabytes (MB). Mutually exclusive --max-fp. (default: 0)
  -k , --kmer-size      The k-mer size to split sequences. (default: 19)
  -w , --window-size    The window-size to build filter with minimizers. (default: 32)
  -s , --hash-functions 
                        The number of hash functions for the interleaved bloom filter [0-5]. 0 to detect optimal value.
                        (default: 0)

optional arguments:
  --restart             Restart build/update from scratch, do not try to resume from the latest possible step.
                        {db_prefix}_files/ will be deleted if present. (default: False)
  --verbose             Verbose output mode (default: False)
  --quiet               Quiet output mode (default: False)

ganon update

usage: ganon update [-h] -d DB_PREFIX [-o] [-t] [--restart] [--verbose] [--quiet]

options:
  -h, --help            show this help message and exit

required arguments:
  -d DB_PREFIX, --db-prefix DB_PREFIX
                        Existing database input prefix (default: None)

important arguments:
  -o , --output-db-prefix 
                        Output database prefix. By default will be the same as --db-prefix and overwrite files (default:
                        None)
  -t , --threads 

optional arguments:
  --restart             Restart build/update from scratch, do not try to resume from the latest possible step.
                        {db_prefix}_files/ will be deleted if present. (default: False)
  --verbose             Verbose output mode (default: False)
  --quiet               Quiet output mode (default: False)

ganon classify

usage: ganon classify [-h] -d [DB_PREFIX ...] [-s [reads.fq[.gz] ...]] [-p [reads.1.fq[.gz] reads.2.fq[.gz] ...]]
                      [-c [...]] [-e [...]] [-o] [--output-lca] [--output-all] [--output-unclassified] [--output-single]
                      [-t] [-l [...]] [-r [...]] [--verbose] [--quiet]

options:
  -h, --help            show this help message and exit

required arguments:
  -d [DB_PREFIX ...], --db-prefix [DB_PREFIX ...]
                        Database input prefix[es] (default: None)
  -s [reads.fq[.gz] ...], --single-reads [reads.fq[.gz] ...]
                        Multi-fastq[.gz] file[s] to classify (default: None)
  -p [reads.1.fq[.gz] reads.2.fq[.gz] ...], --paired-reads [reads.1.fq[.gz] reads.2.fq[.gz] ...]
                        Multi-fastq[.gz] pairs of file[s] to classify (default: None)

cutoff/filter arguments:
  -c [ ...], --rel-cutoff [ ...]
                        Min. percentage of a read (set of minimizers) shared with the a reference necessary to consider
                        a match. Generally used to cutoff low similarity matches. Single value or one per database (e.g.
                        0.7 1 0.25). 0 for no cutoff (default: [0.2])
  -e [ ...], --rel-filter [ ...]
                        Additional relative percentage of minimizers (relative to the best match) to keep a match.
                        Generally used to select best matches above cutoff. Single value or one per hierarchy (e.g. 0.1
                        0). 1 for no filter (default: [0.1])

output arguments:
  -o , --output-prefix 
                        Output prefix for output (.rep) and report (.tre). Empty to output to STDOUT (only .rep)
                        (default: None)
  --output-lca          Output an additional file with one lca match for each read (.lca) (default: False)
  --output-all          Output an additional file with all matches. File can be very large (.all) (default: False)
  --output-unclassified
                        Output an additional file with unclassified read headers (.unc) (default: False)
  --output-single       When using multiple hierarchical levels, output everything in one file instead of one per
                        hierarchy (default: False)

other arguments:
  -t , --threads        Number of sub-processes/threads to use (default: 1)
  -l [ ...], --hierarchy-labels [ ...]
                        Hierarchy definition of --db-prefix files to be classified. Can also be a string, but input will
                        be sorted to define order (e.g. 1 1 2 3). The default value reported without hierarchy is 'H1'
                        (default: None)
  -r [ ...], --ranks [ ...]
                        Ranks to report (.tre). 'all' for all possible ranks. empty for default ranks (superkingdom
                        phylum class order family genus species assembly). This file can be re-generated with the 'ganon
                        report' command (default: None)
  --verbose             Verbose output mode (default: False)
  --quiet               Quiet output mode (default: False)

ganon report

usage: ganon report [-h] -i [...] [-e INPUT_EXTENSION] -o OUTPUT_PREFIX [-d [...]] [-x] [-m [...]] [-f] [-t] [-r [...]]
                    [-s] [-a] [-y] [-p [...]] [-k [...]] [--verbose] [--quiet] [--min-count] [--max-count]
                    [--names [...]] [--names-with [...]] [--taxids [...]]

options:
  -h, --help            show this help message and exit

required arguments:
  -i [ ...], --input [ ...]
                        Input file(s) and/or folder(s). '.rep' file(s) from ganon classify. (default: None)
  -e INPUT_EXTENSION, --input-extension INPUT_EXTENSION
                        Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *).
                        (default: rep)
  -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX
                        Output prefix for report file 'output_prefix.tre'. In case of multiple files, the base input
                        filename will be appended at the end of the output file 'output_prefix + FILENAME.tre' (default:
                        None)

db/tax arguments:
  -d [ ...], --db-prefix [ ...]
                        Database prefix(es) used for classification. Only '.tax' file(s) are required. If not provided,
                        new taxonomy will be downloaded. Mutually exclusive with --taxonomy. (default: [])
  -x , --taxonomy       Taxonomy database to use [ncbi,gtdb,skip]. Mutually exclusive with --db-prefix. (default: ncbi)
  -m [ ...], --taxonomy-files [ ...]
                        Specific files for taxonomy - otherwise files will be downloaded (default: None)

output arguments:
  -f , --output-format 
                        Output format [text, tsv, csv]. text outputs a tabulated formatted text file for better
                        visualization. Default: tsv (default: tsv)
  -t , --report-type    Type of report to generate [reads, matches]. Default: reads (default: reads)
  -r [ ...], --ranks [ ...]
                        Ranks to report ['', 'all', custom list] 'all' for all possible ranks. empty for default ranks
                        (superkingdom phylum class order family genus species assembly). Default: (default: [])
  -s , --sort           Sort report by [rank, lineage, count, unique]. Default: rank (with custom --ranks) or lineage
                        (with --ranks all) (default: )
  -a, --no-orphan       Omit orphan nodes from the final report. Otherwise, orphan nodes (= nodes not found in the
                        db/tax) are reported as 'na' with root as direct parent (default: False)
  -y, --split-hierarchy
                        Split output reports by hierarchy (from ganon classify --hierarchy-labels). If activated, the
                        output files will be named as '{output_prefix}.{hierarchy}.tre' (default: False)
  -p [ ...], --skip-hierarchy [ ...]
                        One or more hierarchies to skip in the report (from ganon classify --hierarchy-labels) (default:
                        [])
  -k [ ...], --keep-hierarchy [ ...]
                        One or more hierarchies to keep in the report (from ganon classify --hierarchy-labels) (default:
                        [])

optional arguments:
  --verbose             Verbose output mode (default: False)
  --quiet               Quiet output mode (default: False)

filter arguments:
  --min-count           Minimum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1
                        specific number] (default: 0)
  --max-count           Maximum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1
                        specific number] (default: 0)
  --names [ ...]        Show only entries matching exact names of the provided list (default: [])
  --names-with [ ...]   Show entries containing full or partial names of the provided list (default: [])
  --taxids [ ...]       One or more taxids to report (including children taxa) (default: [])

ganon table

usage: ganon table [-h] -i [...] [-e] -o OUTPUT_FILE [-l] [-f] [-t] [-a] [-m] [-r] [-n] [--header]
                   [--unclassified-label] [--filtered-label] [--skip-zeros] [--transpose] [--verbose] [--quiet]
                   [--min-count] [--max-count] [--names [...]] [--names-with [...]] [--taxids [...]]

options:
  -h, --help            show this help message and exit

required arguments:
  -i [ ...], --input [ ...]
                        Input file(s) and/or folder(s). '.tre' file(s) from ganon report. (default: None)
  -e , --input-extension 
                        Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *).
                        (default: tre)
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        Output filename for the table (default: None)

output arguments:
  -l , --output-value   Output value on the table [percentage, counts]. percentage values are reported between [0-1].
                        Default: counts (default: counts)
  -f , --output-format 
                        Output format [tsv, csv]. Default: tsv (default: tsv)
  -t , --top-sample     Top hits of each sample individually (default: 0)
  -a , --top-all        Top hits of all samples (ranked by percentage) (default: 0)
  -m , --min-frequency 
                        Minimum number/percentage of files containing an taxa to keep the taxa [values between 0-1 for
                        percentage, >1 specific number] (default: 0)
  -r , --rank           Define specific rank to report. Empty will report all ranks. (default: None)
  -n, --no-root         Do not report root node entry and lineage. Direct and shared matches to root will be accounted
                        as unclassified (default: False)
  --header              Header information [name, taxid, lineage]. Default: name (default: name)
  --unclassified-label 
                        Add column with unclassified count/percentage with the chosen label. May be the same as
                        --filtered-label (e.g. unassigned) (default: None)
  --filtered-label      Add column with filtered count/percentage with the chosen label. May be the same as
                        --unclassified-label (e.g. unassigned) (default: None)
  --skip-zeros          Do not print lines with only zero count/percentage (default: False)
  --transpose           Transpose output table (taxa as cols and files as rows) (default: False)

optional arguments:
  --verbose             Verbose output mode (default: False)
  --quiet               Quiet output mode (default: False)

filter arguments:
  --min-count           Minimum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1
                        specific number] (default: 0)
  --max-count           Maximum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1
                        specific number] (default: 0)
  --names [ ...]        Show only entries matching exact names of the provided list (default: [])
  --names-with [ ...]   Show entries containing full or partial names of the provided list (default: [])
  --taxids [ ...]       One or more taxids to report (including children taxa) (default: [])

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

pirovc / ganon

Programming Languages

Labels

Projects that are alternatives of or similar to ganon

ganon

Quick install/usage guide

Install with conda

Download and build

Classify

Re-generate reports and create tables from multiple reports

Update the database at a later time point

Details

Features

Installation guide

build dependencies

run dependencies

Downloading and building ganon + submodules

Run tests

Examples

Commonly used reference set for metagenomics analysis (complete genomes, NCBI RefSeq, archaea+bacteria+fungi+viral)

Top 3 bacterial genomes (for each taxa) from NCBI RefSeq

Complete GTDB database

Database based on specific taxonomic identifiers (203492 - Fusobacteriaceae)

Customized database at assembly level

Customized database at species level build based on files previously downloaded with genome_updater

Customized database with sequence as target (to classify reads at sequence level)

Output files

build/update

classify

report

table

Building customized databases

Multiple and Hierarchical classification

Classifying reads against multiple databases:

Classifying reads against multiple databases with different cutoffs:

Classifying reads against multiple databases hierarchically:

Classifying reads against multiple databases hierarchically with different cutoffs:

Choosing and explaining parameters

ganon build

filter false positive and size (--max-fp, --filter-size)

minimizers (--window-size, --kmer-size)

ganon classify

reads (--single-reads, --paired-reads)

cutoff and filter (--rel-cutoff, --rel-filter)

ganon build-custom

Target file or sequence (--input-target)

Build level (--level)

Retrieving info (--ncbi-sequence-info, --ncbi-file-info)

Parameters