All Projects → enormandeau → barque

enormandeau / barque

Licence: other
Environmental DNA metabarcoding taxonomic identification

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects
r
7636 projects

Projects that are alternatives of or similar to barque

HumanIdiogramLibrary
Resource of human chromosome schematics & images
Stars: ✭ 76 (+442.86%)
Mutual labels:  genomics
Assemblytics
Assemblytics is a bioinformatics tool to detect and analyze structural variants from a genome assembly by comparing it to a reference genome.
Stars: ✭ 105 (+650%)
Mutual labels:  genomics
jbrowse-components
Monorepo with JBrowse 2 web, JBrowse 2 desktop, the JB core package, and core plugins. To customize behaviors, write an in-house plugin.
Stars: ✭ 89 (+535.71%)
Mutual labels:  genomics
MindTheGap
MindTheGap is a SV caller for short read sequencing data dedicated to insertion variants (all sizes and types). It can also be used as a local assembly tool.
Stars: ✭ 30 (+114.29%)
Mutual labels:  genomics
gosling.js
Grammar of Scalable Linked Interactive Nucleotide Graphics
Stars: ✭ 89 (+535.71%)
Mutual labels:  genomics
kover
Learn interpretable computational phenotyping models from k-merized genomic data
Stars: ✭ 47 (+235.71%)
Mutual labels:  genomics
gnomix
A fast, scalable, and accurate local ancestry method.
Stars: ✭ 36 (+157.14%)
Mutual labels:  genomics
interARTIC
InterARTIC - An interactive local web application for viral whole genome sequencing utilising the artic network pipelines..
Stars: ✭ 22 (+57.14%)
Mutual labels:  genomics
ipyrad
Interactive assembly and analysis of RAD-seq data sets
Stars: ✭ 57 (+307.14%)
Mutual labels:  genomics
TOGA
TOGA (Tool to infer Orthologs from Genome Alignments): implements a novel paradigm to infer orthologous genes. TOGA integrates gene annotation, inferring orthologs and classifying genes as intact or lost.
Stars: ✭ 35 (+150%)
Mutual labels:  genomics
adapt
A package for designing activity-informed nucleic acid diagnostics for viruses.
Stars: ✭ 16 (+14.29%)
Mutual labels:  genomics
phylostratr
An R framework for phylostratigraphy
Stars: ✭ 25 (+78.57%)
Mutual labels:  genomics
RATTLE
Reference-free reconstruction and error correction of transcriptomes from Nanopore long-read sequencing
Stars: ✭ 35 (+150%)
Mutual labels:  genomics
macrel
Predict AMPs in (meta)genomes and peptides
Stars: ✭ 34 (+142.86%)
Mutual labels:  genomics
rnaseq-nf
A proof of concept of RNAseq pipeline
Stars: ✭ 44 (+214.29%)
Mutual labels:  genomics
indelope
find large indels (in the blind spot between GATK/freebayes and SV callers)
Stars: ✭ 38 (+171.43%)
Mutual labels:  genomics
mgatk
mgatk: mitochondrial genome analysis toolkit
Stars: ✭ 65 (+364.29%)
Mutual labels:  genomics
hts-python
pythonic wrapper for htslib
Stars: ✭ 18 (+28.57%)
Mutual labels:  genomics
souporcell
Clustering scRNAseq by genotypes
Stars: ✭ 88 (+528.57%)
Mutual labels:  genomics
get phylomarkers
A pipeline to select optimal markers for microbial phylogenomics and species tree estimation using coalescent and concatenation approaches
Stars: ✭ 34 (+142.86%)
Mutual labels:  genomics

Barque v1.7.4

Environmental DNA metabarcoding analysis

Barque

Developed by Eric Normandeau in Louis Bernatchez's laboratory.

Please see the licence information at the end of this file.

Description

Barque is a fast eDNA metabarcoding analysis pipeline that annotates reads, instead of Operational Taxonomic Unit (OTUs), using high-quality barcoding databases.

Barque can also produce OTUs, which are then annotated using a database. These annotated OTUs are then used as a database themselves to find read counts per OTU per sample, effectively "annotating" the reads with the OTUs that were previously found.

Citation

Barque is described as an accurate and efficient eDNA analysis pipeline in:

Mathon L, Guérin P-E, Normandeau E, Valentini A, Noel C, Lionnet C, Linard B, Thuiller W, Bernatchez L, Mouillot D, Dejean T, Manel S. 2021. Benchmarking bioinformatic tools for fast and accurate eDNA metabarcoding species identification. Molecular Ecology Resources.

https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13430

Use cases

The approach implemented in Barque is especially useful for species management projects:

  • Monitoring invasive species
  • Confirming the presence of specific species
  • Characterizing meta-communities in varied environments
  • Improving species distribution knowledge of cryptic taxa
  • Following loss of species over medium to long-term monitoring

Since Barque depends on the use of high-quality barcoding databases, it is especially useful for COI amplicons used in combination with the Barcode of Life Database (BOLD) or 12S amplicons with the mitofish database, although it can also use any database, for example the Silva database for the 18s gene or any other custom database. However, if species annotations are not possible, Barque can be used in OTU mode.

Installation

To use Barque, you will need a local copy of its repository. Different releases can be found here. It is recommended to always use the latest release or even the developpment version. You can either download an archive of the latest release at the above link or get the latest commit (recommended) with the following git command:

git clone https://github.com/enormandeau/barque

Dependencies

To run Barque, you will also need to have the following programs installed on your computer.

  • Barque will only work on GNU Linux or OSX
  • bash 4+
  • python 3.5+ (you can use miniconda3 to install python)
  • R 3+ (ubuntu/mint: sudo apt-get install r-base-core)
  • java (ubuntu/mint: sudo apt-get install default-jre)
  • gnu parallel
  • flash (read merger) v1.2.11+
  • vsearch v2.14.2+
    • /!\ v2.14.2+ required /!\
    • Barque will not work with older versions of vsearch

Preparation

  • Install dependencies
  • Download a copy of the Barque repository (see Installation above)
  • Edit 02_info/primers.csv to provide information describing your primers
  • Get or prepare the database(s) (see Formatting database section below) and deposit the fasta.gz file in the 03_databases folder and give it a name that matches the information of the 02_info/primers.csv file.
  • Make a copy of 02_info/barque_config.sh, modify the parameters for your run
  • Launch Barque, for example with ./barque 02_info/barque_config.sh

Overview of Barque steps

During the analyses, the following steps are performed:

  • Filter and trim raw reads (trimmomatic)
  • Merge paired-end reads (flash)
  • Split merged reads by amplicon (Python script)
  • Look for chimeras (optional, vsearch --vsearch_global)
  • Merge unique reads (Python script)
  • Find species associated with each unique read (vsearch)
  • Summarize results (Python script)
    • Tables of phylum, genus, and species counts per sample, including multiple hits
    • Number of retained reads per sample at each analysis step with figure
    • Most frequent non-annotated sequences to blast on NCBI nt/nr
    • Species counts for these non-annotated sequences
    • Sequence groups for cases of multiple hits

Running the pipeline

For each new project, get a new copy of Barque from the source listed in the Installation section. In this case, you do not need to modify the primer and config files.

Running on the test dataset

If you want to test Barque, jump straight to the Test dataset section at the end of this file. Read through the README after to better understand the program and it's outputs.

Preparing samples

Copy your paired-end sample files in the 04_data folder. You need one pair of files per sample. The sequences in these files must contain the sequences of the primer that you used during the PCR. Depending on the format in which you received your sequences from the sequencing facility, you may have to proceed to demultiplexing before you can use Barque.

IMPORTANT: The file names must follow this format:

SampleID_*_R1_001.fastq.gz
SampleID_*_R2_001.fastq.gz

Notes: Each sample name, or SampleID, must contain no underscore (_) and be followed by an underscore (_). The star (*) can be any string of text that does not contain space characters. For example, you can use dashed (-) to separate parts of your sample names, eg: PopA-sample001_ANYTHING_R1_001.fastq.gz.

Formatting database

You need to put a database in gzip-compressed Fasta format, or .fasta.gz, in the 03_databases folder.

An augmented version of the mitofish 12S database is already available in Barque.

The pre-formatted BOLD database can be downloaded here.

If you want to use a newer version of the BOLD database, you will need to download all the animal BINs from this page . Put the downloaded Fasta files in 03_databases/bold_bins (you will need to create that folder), and run the commands to format the bold database:

# Format each BIN individually (~10 minutes)
# Note: the `species_to_remove.txt` file is optional
ls -1 03_databases/bold_bins/*.fas.gz |
    parallel ./01_scripts/util/format_bold_database.py \
    {} {.}_prepared.fasta.gz species_to_remove.txt

# Concatenate the resulting formatted bins into one file (~10 seconds)
gunzip -c 03_databases/bold_bins/*_prepared.fasta.gz > 03_databases/bold.fasta
  • For other databases, get the database and format it:
    • gzip-compressed Fasta format (.fasta.gz)
    • Name lines have 3 informations separated by an underscore (_)
    • Ex: >Phylum_Genus_species
    • Ex: >Family_Genus_species
    • Ex: >Mammal_rattus_norvegicus

Configuration file

Make a copy of the file named 02_info/barque_config.sh and modify the parameters as needed.

Launching Barque

Launch the barque executable with the name of your configuration file as an argument, like this:

./barque 02_info/MY_CONFIG_FILE.sh

Results

Once the pipeline has finished running, all result files are found in the 12_results folder.

After a run, it is recomended to make a copy of this folder and name it with the current date, ex:

cp -r 12_results 12_results_PROJECT_NAME_2020-07-27_SOME_ADDITIONAL_INFO

Taxa count tables, named after the primer names

  • PRIMER_genus_table.csv
  • PRIMER_phylum_table.csv
  • PRIMER_species_table.csv

Sequence dropout report and figure

  • sequence_dropout.csv: Listing how many sequences were present in each sample for every analysis step. Depending on library and sequencing quality, as well as the biological diversity found at the sample site, more or less sequences are lost at each of the analysis steps. The figure sequence_dropout_figure.png shows how many sequences are retained for each sample at each step of the pipeline.

Most frequent non-annotated sequences

  • most_frequent_non_annotated_sequences.fasta: Sequences that are frequent in the samples but were not annotated by the pipeline. This Fasta file should be used to query the NCBI nt/nr database using the online portal found here to see what species may have been missed. Use blastn with default parameters. Once the NCBI blastn search is finished, download the results as a text file and use the following command (you will need to adjust the input and output file names) to generate a report of the most frequently found species in the non-annotated sequences:

Fasta files with sequences from multiple hit groups

  • 12_results/01_multihits contains fasta file with database and sample sequences to help understand why some of the sequences cannot be unambiguously assigned to one species. For example, sometimes two different species can have identical reads in the database. At other times sample sequences can have the same distance from the sequences of two species in the database.

Summarize species found in non-annotated sequences

./01_scripts/10_report_species_for_non_annotated_sequences.py \
    12_results/NCBI-Alignment.txt \
    12_results/most_frequent_non_annotated_sequences_species_ncbi.csv 97 |
    sort -u -k 2,3 | cut -c 2- | perl -pe 's/ /\t/' > missing_species_97_percent.txt

The first result file will contain one line per identified taxon and the number of sequences for each taxon, sorted in decreasing order. For any species of interest found in this file, it is a good idea to download the representative sequences from NCBI, add them to the database, and rerun the analysis.

You can modify the percentage value, here 97. The missing_species_97_percent.txt file will list the sequence identifiers from NCBI so that you can download them from the online database and add them to your own database as needed.

One way to do this automatically is to make a file with only the first column, that is: one NCBI sequence identifier per line, and load it on this page:

https://www.ncbi.nlm.nih.gov/sites/batchentrez

You will need to rename the sequences to follow the database name format described in the Formatting database section and add them to your current database.

Log files and parameters

For each Barque run, three files are written in the 99_logfiles folder. Each contain a timestamp with the time of the run:

  1. The exact barque config file that has been used
  2. The exact primer file as it was used
  3. The full log of the run

Lather, Rinse, Repeat

Once the pipeline has been run, it is normal to find that unexpected species have been found or that a proportion of the reads have not been identified, either because the sequenced species are absent from the database or because the sequences have the exact same distance from two or more sequences in the database. In these cases, you will need to remove unwanted species from the database or download additional sequences for the non-annotated species from NCBI to add them to it. Once the database has been improved, simply run the pipeline again with this new database. You can putSKIP_DATA_PREP=1 in your config file if you wisht to avoid repeating the initial data preparation steps of Barque. You may need to repeat this procedure again until you are satisfied with the completeness of the results.

NOTE: You should provide justifications in your publications if you decide to remove some species from the results or database.

Test dataset

A test dataset is available as a sister repository on GitHub. It is composed of 10 mitofish-12S metabarcoding samples, each with 10,000 forward and 10,000 reverse sequences.

Download the repository and then move the data from barque_test_dataset/04_data to Barque's 04_data folder.

If you have git and Barque's dependencies installed, the following commands will download the Barque repository and the test data and put them in the appropriate folder.

git clone https://github.com/enormandeau/barque
git clone https://github.com/enormandeau/barque_test_dataset
cp barque_test_dataset/04_data/* barque/04_data/

To run the analysis, move to the barque folder and launch:

cd barque
./barque 02_info/barque_config.sh

The analysis of this test dataset takes 25 seconds on a Linux ThinkPad laptop from 2012 running with 4 core-i7 CPUs and 70 seconds on the same laptop using only one CPU.

License

CC share-alike

Creative Commons Licence
Barque by Eric Normandeau is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://github.com/enormandeau/barque.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].