All Projects → natir → Yacrd

natir / Yacrd

Licence: mit
Yet Another Chimeric Read Detector

Programming Languages

rust
11053 projects

Projects that are alternatives of or similar to Yacrd

Swarm
A robust and fast clustering method for amplicon-based studies
Stars: ✭ 88 (+79.59%)
Mutual labels:  bioinformatics, sequence
Bioconvert
Bioconvert is a collaborative project to facilitate the interconversion of life science data from one format to another.
Stars: ✭ 112 (+128.57%)
Mutual labels:  bioinformatics, sequence
Seqkit
A cross-platform and ultrafast toolkit for FASTA/Q file manipulation in Golang
Stars: ✭ 607 (+1138.78%)
Mutual labels:  bioinformatics, sequence
Sv Callers
Snakemake-based workflow for detecting structural variants in WGS data
Stars: ✭ 28 (-42.86%)
Mutual labels:  bioinformatics
Protr
Comprehensive toolkit for generating various numerical features of protein sequences
Stars: ✭ 30 (-38.78%)
Mutual labels:  bioinformatics
Migmap
HTS-compatible wrapper for IgBlast V-(D)-J mapping tool
Stars: ✭ 38 (-22.45%)
Mutual labels:  bioinformatics
Sns
Analysis pipelines for sequencing data
Stars: ✭ 43 (-12.24%)
Mutual labels:  bioinformatics
Workshop
课题组每周研讨会
Stars: ✭ 28 (-42.86%)
Mutual labels:  bioinformatics
Awesome Vdj
Tools and databases for analyzing HLA and VDJ genes.
Stars: ✭ 43 (-12.24%)
Mutual labels:  bioinformatics
Locuszoom Standalone
Create regional association plots from GWAS or meta-analysis
Stars: ✭ 35 (-28.57%)
Mutual labels:  bioinformatics
Etrf
Exact Tandem Repeat Finder (not a TRF replacement)
Stars: ✭ 35 (-28.57%)
Mutual labels:  bioinformatics
Fastp
An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)
Stars: ✭ 966 (+1871.43%)
Mutual labels:  bioinformatics
Gatk
Official code repository for GATK versions 4 and up
Stars: ✭ 1,002 (+1944.9%)
Mutual labels:  bioinformatics
Cytometry Clustering Comparison
R scripts to reproduce analyses in our paper comparing clustering methods for high-dimensional cytometry data
Stars: ✭ 30 (-38.78%)
Mutual labels:  bioinformatics
Verifybamid
VerifyBamID2: A robust tool for DNA contamination estimation from sequence reads using ancestry-agnostic method.
Stars: ✭ 44 (-10.2%)
Mutual labels:  bioinformatics
Rasusa
Randomly subsample sequencing reads to a specified coverage
Stars: ✭ 28 (-42.86%)
Mutual labels:  bioinformatics
Fill Range
Fill in a range of numbers or letters, positive or negative, optionally passing an increment or multiplier to use.
Stars: ✭ 41 (-16.33%)
Mutual labels:  sequence
Genevalidator
GeneValidator: Identify problems with predicted genes
Stars: ✭ 34 (-30.61%)
Mutual labels:  bioinformatics
Bwa
Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
Stars: ✭ 970 (+1879.59%)
Mutual labels:  bioinformatics
Uta
Universal Transcript Archive: comprehensive genome-transcript alignments; multiple transcript sources, versions, and alignment methods; available as a docker image
Stars: ✭ 38 (-22.45%)
Mutual labels:  bioinformatics

Yet Another Chimeric Read Detector for long reads

Using all-against-all read mapping, yacrd performs:

  1. computation of pile-up coverage for each read
  2. detection of chimeras

Chimera detection is done as follows:

  1. for each region where coverage is smaller or equal than min_coverage (default 0), yacrd creates a bad region.
  2. if there is a bad region that starts at a position strictly after the beginning of the read and ends strictly before the end of the read, the read is marked as Chimeric
  3. if total bad region length > 0.8 * read length, the read is marked as NotCovered
  4. if a read isn't Chimeric or NotCovered is NotBad

Rationale

Long read error-correction tools usually detect and also remove chimeras. But it is difficult to isolate or retrieve information from just this step.

DAStrim (from the DASCRUBBER suite does a similar job to yacrd but relies on a different mapping step, and uses different (likely more advanced) heuristics. Yacrd is simpler and easier to use.

This repository contains a set of scripts to evaluate yacrd against other similar tools such as DASCRUBBER and miniscrub on real data sets.

Input

Any set of long reads (PacBio, Nanopore, anything that can be given to minimap2). yacrd takes the resulting PAF (Pairwise Alignement Format) from minimap2 or BLASR m4 file from some other long reads overlapper as input.

Requirements

  • Rust in stable channel
  • libgz
  • libbzip2
  • liblzma

Instalation

With conda

yacrd is avaible in bioconda channel

if bioconda channel is setup you can run :

conda install yacrd

From source

git clone https://github.com/natir/yacrd.git
cd yacrd
git checkout v0.6.2

cargo build
cargo test
cargo install --path .

How to use Yacrd

Find chimera

minimap2 reads.fq reads.fq > overlap.paf
yacrd -i overlap.paf -o reads.yacrd

Post-detection operation

yacrd can perform some post-detection operation:

  • filter: for sequence or overlap file, record with reads marked as Chimeric or NotCovered isn't write in output
  • extract: for sequence or overlap file, record contains reads marked as Chimeric or NotCovered is write in output
  • split: for sequence file bad region in middle of reads are removed, NotCovered read is removed
  • scrubb: for sequence file all bad region are removed, NotCovered read is removed
minimap2 reads.fq reads.fq > mapping.paf
yacrd -i mapping.paf -o reads.yacrd filter -i reads.fasta -o reads.filter.fasta
yacrd -i mapping.paf -o reads.yacrd extract -i reads.fasta -o reads.extract.fasta
yacrd -i mapping.paf -o reads.yacrd split -i reads.fasta -o reads.split.fasta
yacrd -i mapping.paf -o reads.yacrd scrubb -i reads.fasta -o reads.scrubb.fasta

Read scrubbing overlapping recommended parameter

For nanopore data, we recommend using minimap2 with all-vs-all nanopore preset with a maximal distance between seeds fixe to 500 (option -g 500) to generate overlap. We recommend to run yacrd with minimal coverage fixed to 4 (option -c) and minimal coverage of read fixed to 0.4 (option -n).

This is an exemple of how run a yacrd scrubbing:

minimap2 -x ava-ont -g 500 reads.fasta reads.fasta > overlap.paf
yacrd -i overlap.paf -o report.yacrd -c 4 -n 0.4 scrubb -i reads.fasta -o reads.scrubb.fasta

For pacbio P6-C4 data, we recommend to use minimap2 with all-vs-all pacbio preset with a maximal distance between seeds fixe to 800 (option -g 800) to generate overlap. We recommend to run yacrd with minimal coverage fixed to 4 (option -c 4) and minimal coverage of read fixed to 0.4 (option -n 0.4).

minimap2 -x ava-pb -g 800 reads.fasta reads.fasta > overlap.paf
yacrd -i overlap.paf -o report.yacrd -c 4 -n 0.4 scrubb -i reads.fasta -o reads.scrubb.fasta

For pacbio Sequel data, we recommend to use minimap2 with all-vs-all pacbio preset with a maximal distance between seeds fixe to 5000 (option -g 5000) to generate overlap. We recommand to run yacrd with minimal coverage fixed to 3 (option -c 3) and minimal coverage of read fixed to 0.4 (option -n 0.4).

minimap2 -x ava-pb -g 5000 reads.fasta reads.fasta > overlap.paf
yacrd -i overlap.paf -o report.yacrd -c 3 -n 0.4 scrubb -i reads.fasta -o reads.scrubb.fasta

Important note

Extension

yacrd use extension to detect format file if your filename contains (anywhere):

  • .paf: file is consider has minimap file
  • .m4, .mhap: file is consider has blasr m4 file (mhap output)
  • .fa, .fasta: file is consider has fasta file
  • .fq, .fastq: file is consider has fastq file
  • .yacrd: file is consider has yacrd output file

Compression

yacrd automatically detect file if is compress or not (gzip, bzip2 and lzma compression is available). For post-detection operation, if input is compressed output have the same compression format.

Use yacrd report as input

You can use yacrd report as input in place of overlap file, ondisk option are ignored if you use yarcd report has input.

Output

type_of_read    id_in_mapping_file  length_of_read  length_of_gap,begin_pos_of_gap,end_pos_of_gap;length_of_gap,be…

Example

NotCovered readA 4599    3782,0,3782

Here, readA doesn't have sufficient coverage, there is a zero-coverage region of length 3782bp between positions 0 and 3782.

Chimeric    readB   10452   862,1260,2122;3209,4319,7528

Here, readB is chimeric with 2 zero-coverage regions: one between bases 1260 and 2122, another between 4319 and 7528.

Citation

If you use yacrd in your research, please cite the following publication:

Pierre Marijon, Rayan Chikhi, Jean-Stéphane Varré, yacrd and fpa: upstream tools for long-read genome assembly, Bioinformatics, btaa262, https://doi.org/10.1093/bioinformatics/btaa262

bibtex format:

@article {@article{Marijon_2020,
	doi = {10.1093/bioinformatics/btaa262},
	url = {https://doi.org/10.1093%2Fbioinformatics%2Fbtaa262},
	year = 2020,
	month = {apr},
	publisher = {Oxford University Press ({OUP})},
	author = {Pierre Marijon and Rayan Chikhi and Jean-St{\'{e}}phane Varr{\'{e}}},
	editor = {Inanc Birol},
	title = {yacrd and fpa: upstream tools for long-read genome assembly},
	journal = {Bioinformatics}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].