Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → brentp → Smoove

brentp / Smoove

Licence: apache-2.0

structural variant calling and genotyping with existing tools, but, smoothly.

Programming Languages

31211 projects - #10 most used programming language

Labels

genomics

Projects that are alternatives of or similar to Smoove

Slivar

variant expressions, annotation, and filtering for great good.

Stars: ✭ 110 (-25.17%)

Mutual labels: genomics

Sarek

Detect germline or somatic variants from normal or tumour/normal whole-genome or targeted sequencing

Stars: ✭ 124 (-15.65%)

Mutual labels: genomics

Hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads

Stars: ✭ 134 (-8.84%)

Mutual labels: genomics

Cooler

A cool place to store your Hi-C

Stars: ✭ 112 (-23.81%)

Mutual labels: genomics

Circlator

A tool to circularize genome assemblies

Stars: ✭ 121 (-17.69%)

Mutual labels: genomics

Benchmarking Tools

Repository for the GA4GH Benchmarking Team work developing standardized benchmarking methods for germline small variant calls

Stars: ✭ 129 (-12.24%)

Mutual labels: genomics

Genomics

A collection of scripts and notes related to genomics and bioinformatics

Stars: ✭ 101 (-31.29%)

Mutual labels: genomics

Awesome Bioinformatics Benchmarks

A curated list of bioinformatics bench-marking papers and resources.

Stars: ✭ 142 (-3.4%)

Mutual labels: genomics

Kmer Cnt

Code examples of fast and simple k-mer counters for tutorial purposes

Stars: ✭ 124 (-15.65%)

Mutual labels: genomics

Octopus

Bayesian haplotype-based mutation calling

Stars: ✭ 131 (-10.88%)

Mutual labels: genomics

Qqman

An R package for creating Q-Q and manhattan plots from GWAS results

Stars: ✭ 115 (-21.77%)

Mutual labels: genomics

Hicexplorer

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.

Stars: ✭ 116 (-21.09%)

Mutual labels: genomics

Miso Lims

MISO: An open-source LIMS for NGS sequencing centres

Stars: ✭ 131 (-10.88%)

Mutual labels: genomics

Cgranges

A C/C++ library for fast interval overlap queries (with a "bedtools coverage" example)

Stars: ✭ 111 (-24.49%)

Mutual labels: genomics

Artemis

Artemis is a free genome viewer and annotation tool that allows visualization of sequence features and the results of analyses within the context of the sequence, and its six-frame translation

Stars: ✭ 135 (-8.16%)

Mutual labels: genomics

Msprime

Simulate genealogical trees and genomic sequence data using population genetic models

Stars: ✭ 103 (-29.93%)

Mutual labels: genomics

Somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"

Stars: ✭ 128 (-12.93%)

Mutual labels: genomics

Biomartr

Genomic Data Retrieval with R

Stars: ✭ 144 (-2.04%)

Mutual labels: genomics

Hgvs

Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`

Stars: ✭ 138 (-6.12%)

Mutual labels: genomics

Hts Nim

nim wrapper for htslib for parsing genomics data files

Stars: ✭ 132 (-10.2%)

Mutual labels: genomics

View All Similar Projects ➔

smoove

smoove simplifies and speeds calling and genotyping SVs for short reads. It also improves specificity by removing many spurious alignment signals that are indicative of low-level noise and often contribute to spurious calls.

There is a blog-post describing smoove in more detail here

It both supports small cohorts in a single command, and population-level calling with 4 total steps, 2 of which are parallel by sample.

There is a table on the precision and recall of smoove and duphold (which is used by smoove)here

It requires:

lumpy and lumpy_filter
samtools: for CRAM support
gsort: to sort final VCF
bgzip+tabix: to compress and index final VCF

And optionally (but all highly recommended):

svtyper: to genotypes SVs
svtools: required for large cohorts
mosdepth: remove high coverage regions.
bcftools: version 1.5 or higher for VCF indexing and filtering.
duphold: to annotate depth changes within events and at the break-points.

Running smoove without any arguments will show which of these are found so they can be added to the PATH as needed.

smoove will:

parallelize calls to lumpy_filter to extract split and discordant reads required by lumpy
further filter lumpy_filter calls to remove high-coverage, spurious regions and user-specified chroms like 'hs37d5'; it will also remove reads that we've found are likely spurious signals. after this, it will remove singleton reads (where the mate was removed by one of the previous filters) from the discordant bams. This makes lumpy much faster and less memory-hungry.
calculate per-sample metrics for mean, standard deviation, and distribution of insert size as required by lumpy.
stream output of lumpy directly into multiple svtyper processes for parallel-by-region genotyping while lumpy is still running.
sort, compress, and index final VCF.

installation

you can get smoove and all dependencies via (a large) docker image:

docker pull brentp/smoove
docker run -it brentp/smoove smoove -h

Or, you can download a smoove binary from here: https://github.com/brentp/smoove/releases When run without any arguments, smoove will show you which of it's dependencies it can find so you can adjust your $PATH and install accordingly.

usage

small cohorts (n < ~ 40)

for small cohorts it's possible to get a jointly-called, genotyped VCF in a single command.

smoove call -x --name my-cohort --exclude $bed --fasta $reference_fasta -p $threads --genotype /path/to/*.bam

output will go to ./my-cohort-smoove.genotyped.vcf.gz

the --exclude $bed is highly recommended as it can be used to ignore reads that overlap problematic regions.

A good set of regions for GRCh37 is here.

And for hg38 here

population calling

For population-level calling (large cohorts) the steps are:

For each sample, call genotypes:

smoove call --outdir results-smoove/ --exclude $bed --name $sample --fasta $reference_fasta -p 1 --genotype /path/to/$sample.bam

For large cohorts, it's better to parallelize across samples rather than using a large $threads per sample. smoove can only parallelize up to 2 or 3 threads on a single-sample and it's most efficient to use 1 thread.

output will go to `results-smoove/$sample-smoove.genotyped.vcf.gz``

Get the union of sites across all samples (this can parallelize this across as many CPUs or machines as needed):

# this will create ./merged.sites.vcf.gz
smoove merge --name merged -f $reference_fasta --outdir ./ results-smoove/*.genotyped.vcf.gz

genotype each sample at those sites (this can parallelize this across as many CPUs or machines as needed) and run duphold to add depth annotations.

smoove genotype -d -x -p 1 --name $sample-joint --outdir results-genotped/ --fasta $reference_fasta --vcf merged.sites.vcf.gz /path/to/$sample.$bam

paste all the single sample VCFs with the same number of variants to get a single, squared, joint-called file.

smoove paste --name $cohort results-genotyped/*.vcf.gz

(optional) annotate the variants with exons, UTRs that overlap from a GFF and annotate high-quality heterozygotes:

smoove annotate --gff Homo_sapiens.GRCh37.82.gff3.gz $cohort.smoove.square.vcf.gz | bgzip -c > $cohort.smoove.square.anno.vcf.gz

This adds a SHQ (Smoove Het Quality) tag to every sample format) a value of 4 is a high quality call and the value of 1 is low quality. -1 is non-het. It also adds a MSHQ for Mean SHQ to the INFO field which is the mean SHQ score across all heterozygous samples for that variant.

As a first pass, users can look for variants with MSHQ > 3. If you added duphold annotations, it's also useful to check deletions with DHFFC < 0.7 and duplications with DHFFC > 1.25.

Troubleshooting

A panic with a message like Segmentation fault (core dumped) | bcftools view -O z -c 1 -o is likely to mean you have an old version of bcftools. see #10
smoove will write to the system TMPDIR. For large cohorts, make sure to set this to something with a lot of space. e.g. export TMPDIR=/path/to/big
smoove requires recent version of lumpy and lumpy_filter so build those from source or get the most recent bioconda version.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

brentp / Smoove

Programming Languages

Labels

Projects that are alternatives of or similar to Smoove

smoove

installation

usage

small cohorts (n < ~ 40)

population calling

Troubleshooting

see also