All Projects → brentp → genoiser

brentp / genoiser

Licence: MIT license
use the noise

Programming Languages

nim
578 projects
shell
77523 projects

Projects that are alternatives of or similar to genoiser

indelope
find large indels (in the blind spot between GATK/freebayes and SV callers)
Stars: ✭ 38 (+153.33%)
Mutual labels:  genomics, nim-lang
psmc
Implementation of the Pairwise Sequentially Markovian Coalescent (PSMC) model
Stars: ✭ 121 (+706.67%)
Mutual labels:  genomics
bfc
High-performance error correction for Illumina resequencing data
Stars: ✭ 66 (+340%)
Mutual labels:  genomics
snp-sites
Finds SNP sites from a multi-FASTA alignment file
Stars: ✭ 182 (+1113.33%)
Mutual labels:  genomics
PHIST
Phage-Host Interaction Search Tool
Stars: ✭ 19 (+26.67%)
Mutual labels:  genomics
bigly
a pileup library that embraces the huge
Stars: ✭ 38 (+153.33%)
Mutual labels:  genomics
ezancestry
Easy genetic ancestry predictions in Python
Stars: ✭ 38 (+153.33%)
Mutual labels:  genomics
DriverPower
DriverPower
Stars: ✭ 22 (+46.67%)
Mutual labels:  genomics
DRAM
Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes
Stars: ✭ 159 (+960%)
Mutual labels:  genomics
bac-genomics-scripts
Collection of scripts for bacterial genomics
Stars: ✭ 39 (+160%)
Mutual labels:  genomics
Unchained
A fully type safe, compile time only units library.
Stars: ✭ 70 (+366.67%)
Mutual labels:  nim-lang
biopython-coronavirus
Biopython Jupyter Notebook tutorial to characterize a small genome
Stars: ✭ 80 (+433.33%)
Mutual labels:  genomics
mity
mity: A highly sensitive mitochondrial variant analysis pipeline for whole genome sequencing data
Stars: ✭ 27 (+80%)
Mutual labels:  genomics
simuG
simuG: a general-purpose genome simulator
Stars: ✭ 68 (+353.33%)
Mutual labels:  genomics
geoguessrnim
GeoGuessr browser plugin, hide Ads, Filters for StreetView and Mapillary for Chromium and Firefox
Stars: ✭ 17 (+13.33%)
Mutual labels:  nim-lang
genipe
Genome-wide imputation pipeline
Stars: ✭ 28 (+86.67%)
Mutual labels:  genomics
sample
Performs memory-efficient reservoir sampling on very large input files delimited by newlines
Stars: ✭ 61 (+306.67%)
Mutual labels:  genomics
nimSocks
A filtering SOCKS proxy server and client library written in nim.
Stars: ✭ 51 (+240%)
Mutual labels:  nim-lang
genepattern-server
The GenePattern Server web application
Stars: ✭ 26 (+73.33%)
Mutual labels:  genomics
gubbins
Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins
Stars: ✭ 103 (+586.67%)
Mutual labels:  genomics

genoiser: the noise is the signal

given a lot of alignment files, genoiser helps to find the areas of the genome that have noise signals that result in bad variant (small, SV, ME) calls. It finds per-sample noise and then aggregates across samples and reports the number of samples at each base in the genome that had a given noise signal. These regions can be used as black-list regions instead of or in addition to LCRs.

mosdepth uses chromosome-sized arrays of int32's to track sequencing depth. This is fast and flexible.

given mosdepth as a special-case for depth, genoiser is a general case for user-defined functions. mosdepth could be implemented with genoiser.

An added benefit is the reduction of memory; mosdepth allocates an int32 array the size of each chromosome--meaning about 1GB of memory for chromosome 1. genoiser can use smaller-sized chunks to tile across each chromosome. This is important because it uses 1 array for each user-defined function. It defaults to 8 megabase chunks as that is the smallest size with no noticeable effect on performance in our tests. Chunk sizes down to 100KB have some, but minor effect on performance.

The idea is of genoiser is that it will handle all accounting, a user simply defines a nim function that takes an alignment and then indicates which genomic positions to increment. For example, to calculate depth, this user function would increment from start to end:

proc depthfun*(aln:Record, posns:var seq[mrange]) =
  ## depthfun is an example of a `fun` that can be sent to `genoiser`.
  ## it increments from aln.start to aln.stop of passing reads.
  var f = aln.flag
  if f.unmapped or f.secondary or f.qcfail or f.dup: return
  posns.add((aln.start, aln.stop, 1))

The posns value is sent to the function by genoiser and the user-defined function can add to it as many elements as desired. In this case it increments from aln.start to aln.stop by 1. It can inrement by any integer value.

The user could also choose to increment any soft or hard-clip location:

proc softfun*(aln:Record, posns:var seq[mrange]) =
  ## softfun an example of a `fun` that can be sent to `genoiser`.
  ## it sets positions where there are soft-clips
  var f = aln.flag
  if f.unmapped or f.secondary or f.supplementary or f.qcfail or f.dup: return
  var cig = aln.cigar
  if cig.len == 1: return
  var pos = aln.start

  for op in cig:
    if op.op == CigarOp.soft_clip or op.op == CigarOp.hard_clip:
      # for this function, we want the exact break-points, not the span of the event,
      # so we increment the position and the one that follows it.
      posns.add((pos, pos+1, 1))
    if op.consumes.reference:
      pos += op.len

Utility

This library provides the machinery. Other command-line tools will use this for more obviously useful things.

Speed

for maximum speed, compile with nim c -d:release --passC:-flto --passL:-s --gc:markAndSweep src/genoiser.nim

CLI

The command-line interface allows running pre-specified noise filters in 2 steps. The first steps calculates the "noise" in each sample.

genoiser per-sample --fasta $reference results/$sample /path/to/$sample.bam # or cram

This will use a single thread and it will take about 1 hour for 30X bams and a bit more for crams. It will output 5 files per sample. at the specificied prefix, in this case, results/$sample

After all samples have been run, then the user can aggregate the signal across samples. The recommended commands are:

chroms="$(seq 1 22) X Y"

########
## soft
########
# count number of samples at each site where more than 1 and less than 15% of reads were soft-clipped.
echo $chroms | tr ' ' '\n' \
    | gargs -d -v -p 5 "genoiser aggregate -t 10 -e '(value / depth) < 0.15 & (value > 1)' results/*.genoiser.{}.soft.bed > soft.{}.bed"

# count number of samples at each site where more than 2 where soft-clipped.
echo $chroms | tr ' ' '\n' \
    | gargs -d -v -p 5 "genoiser aggregate -t 10 -e '(value > 2)' results/*.genoiser.{}.soft.bed > high-soft.{}.bed"


###########
## mq0
###########

# count number of samples at each site where more than 2 reads had MQ0.
echo $chroms | tr ' ' '\n' \
    | gargs -d -v -p 5 "genoiser aggregate -t 10 -e '(value > 2)' results/*.genoiser.{}.mq0.bed > mq0.{}.bed"


############
## weird
############
# count number of samples at each site where more than 1 and less than 15% of reads were weird.
echo $chroms | tr ' ' '\n' \
    | gargs -d -v -p 5 "genoiser aggregate -t 10 -e '(value / depth) < 0.15 & (value > 1)' results/*.genoiser.{}.weird.bed > weird.{}.bed"

echo $chroms | tr ' ' '\n' \
    | gargs -d -v -p 5 "genoiser aggregate -t 10 -e '(value > 2)' results/*.genoiser.{}.weird.bed > high-weird.{}.bed"

############
## mismatches
############

# count number of samples at each site where 10 or more reads read with 4 or more mismatches overlapped.
echo $chroms | tr ' ' '\n' \
    | gargs -d -v -p 5 "genoiser aggregate -t 10 -e 'value > 10' results/*.genoiser.{}.mismatches.bed > mismatches.{}.bed"


##################
# interchromosomal
##################

# count number of samples at each site where more than 1 and less than 15% of reads were interchromosomal.
echo $chroms | tr ' ' '\n' \
    | gargs -d -v -p 5 "genoiser aggregate -t 10 -e '(value / depth) < 0.15 & (value > 1)' results/*.genoiser.{}.interchromosomal.bed > interchromosomal.{}.bed"

echo $chroms | tr ' ' '\n' \
    | gargs -d -v -p 5 "genoiser aggregate -t 10 -e '(value > 2)' results/*.genoiser.{}.interchromosomal.bed > high-interchromosomal.{}.bed"

where gargs is available as a static binary from here

This will output 1 file per chromosome, per metric. The resulting files are the desired output.

Note that this will use 5*10 threads. Adjust this to match your CPU requirements.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].