Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → brentp → Somalier

brentp / Somalier

Licence: mit

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"

Programming Languages

578 projects

Labels

bioinformatics genomics cancer-genomics

Projects that are alternatives of or similar to Somalier

Personalized Genomics and Proteomics. Main diet: Ensembl, side dishes: SNPs

Stars: ✭ 261 (+103.91%)

Mutual labels: bioinformatics, genomics, cancer-genomics

Genomics Extension for SQLite

Stars: ✭ 90 (-29.69%)

Mutual labels: bioinformatics, genomics

Awesome 10x Genomics

List of tools and resources related to the 10x Genomics GEMCode/Chromium system

Stars: ✭ 82 (-35.94%)

Mutual labels: bioinformatics, genomics

HiCExplorer is a powerful and easy to use set of tools to process, normalize and visualize Hi-C data.

Stars: ✭ 116 (-9.37%)

Mutual labels: bioinformatics, genomics

A tool to circularize genome assemblies

Stars: ✭ 121 (-5.47%)

Mutual labels: bioinformatics, genomics

A fast whole-genome aligner based on de Bruijn graphs

Stars: ✭ 76 (-40.62%)

Mutual labels: bioinformatics, genomics

Gcp For Bioinformatics

GCP Essentials for Bioinformatics Researchers

Stars: ✭ 95 (-25.78%)

Mutual labels: bioinformatics, genomics

Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins

Stars: ✭ 67 (-47.66%)

Mutual labels: bioinformatics, genomics

Inference of ploidy and heterozygosity structure using whole genome sequencing data

Stars: ✭ 98 (-23.44%)

Mutual labels: bioinformatics, genomics

Code examples of fast and simple k-mer counters for tutorial purposes

Stars: ✭ 124 (-3.12%)

Mutual labels: bioinformatics, genomics

NGLess: NGS with less work

Stars: ✭ 115 (-10.16%)

Mutual labels: bioinformatics, genomics

A cool place to store your Hi-C

Stars: ✭ 112 (-12.5%)

Mutual labels: bioinformatics, genomics

An interactive web tool for quality control of DNA sequencing data

Stars: ✭ 76 (-40.62%)

Mutual labels: bioinformatics, genomics

Bayesian genotyper for structural variants

Stars: ✭ 79 (-38.28%)

Mutual labels: bioinformatics, genomics

Flexible genotype query among 30,000+ samples whole-genome

Stars: ✭ 72 (-43.75%)

Mutual labels: bioinformatics, genomics

Bioinformatics library for .NET

Stars: ✭ 90 (-29.69%)

Mutual labels: bioinformatics, genomics

A C/C++ library for fast interval overlap queries (with a "bedtools coverage" example)

Stars: ✭ 111 (-13.28%)

Mutual labels: bioinformatics, genomics

Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes

Stars: ✭ 47 (-63.28%)

Mutual labels: bioinformatics, genomics

Model and predict short DNA sequence features with neural networks

Stars: ✭ 59 (-53.91%)

Mutual labels: bioinformatics, genomics

Antimicrobial Resistance Identification By Assembly

Stars: ✭ 96 (-25%)

Mutual labels: bioinformatics, genomics

View All Similar Projects ➔

somalier: extract informative sites, evaluate relatedness, and perform quality-control on BAM/CRAM/BCF/VCF/GVCF

Existing software for checking relatedness requires jointly-called germ-line variant calls, but cancer projects have only somatic calls between tumor-normal pairs.

somalier makes checking any number of samples for identity easy directly from the alignments:

The first step is to extract sites. For VCF just use:

somalier extract -d extracted/ --sites sites.vcf.gz -f /data/human/g1k_v37_decoy.fa $cohort.vcf.gz

with a sites file from releases

For BAM or CRAM, use: This is parallelizable by sample via cluster or cloud, but here, using a for loop:

for f in *.cram; do
    somalier extract -d extracted/ --sites sites.vcf.gz -f /data/human/g1k_v37_decoy.fa $f
done

--sites is a VCF of known polymorphic sites in VCF format. A good set is provided in the releases but any set of common variants will work.

The next step is to calculate relatedness on the extracted data:

somalier relate --ped $pedigree extracted/*.somalier

This will create text and interactive HTML output (similar to peddy) that makes it fast and easy to detect mismatched samples and sample-swaps.

Example output is here

Note that the somalier relate command runs extremely quickly (< 2 seconds for 600 samples and ~1 minute for 4,500 samples) so it's possible to add/remove samples or adjust a pedigree file and re-run iteratively.

For example to add the n + 1th samples, just run somalier extract on the new sample and then re-use the already extracted data from the n original samples.

For huge sample-sets, if you run into a bash error for argument list too long, you can pass the somalier files as quoted glob strings like: "/path/to/set-a/*.somalier" "/path/to/set-b/*.somalier".

Example Output

Interactive output from somalier relate is here
Interactive output from somalier ancestry is here

Usage

The usage is also described above. Briefly, after downloading the somalier binary and a sites vcf from the releases run:

somalier extract -d cohort/ --sites sites.hg38.vcf.gz -f $reference $sample.bam

for each sample to create a small binary file of the ref and alt counts for the variants listed in sites.hg38.vcf.gz.

for a vcf, run:

somalier extract -d cohort/ --sites sites.hg38.vcf.gz -f $reference $cohort.bcf

Then run:

somalier relate --ped $pedigree_file cohort/*.somalier

This will create an html file for QC in a few seconds.

Note that if a new sample is added to the cohort, it's only necessary to perform the extract step on that sample and then run the (fast) relate step again with all of the extracted files.

VCF

somalier can extract from a multi or single-sample VCF or a GVCF. This will be much faster, in cases where it's available, this would look like:

somalier extract -d extracted/ --sites sites.vcf.gz -f /data/human/g1k_v37_decoy.fa joint.vcf.gz

following this, there will be a $sample.somalier file for each sample in the joint.vcf.gz

Note that somalier uses the AD field to extract depth information. If that FORMAT field is not present in the header, then it will use the genotypes only and use a total depth of 20 (10,10 for heterozygote), etc.

Install

get a static binary from here

Users can also get a docker image here which contains htslib and a somalier binary ready-for-use.

How it works

somalier takes a list of known polymorphic sites. Even a few hundred (or dozen) sites can be a very good indicator of relatedness. The best sites are those with a population allele frequency close to 0.5 as that maximizes the probability that any 2 samples will differ. A list of such sites is provided in the releases for GRCh37 and hg38.

In order to quickly calculate genotypes at these sites, somalier assays the exact base. The extraction step is done directly from the bam/cram files 1 sample at a time.

The relate step is run on the output of the extract commands. It runs extremely quickly so that new samples can be added and compared. It uses 3 bit-vectors per sample for hom-ref, het, hom-alt. Each bitvector is a sequence of 64 bit integers where each bit is set if the variant at that index in the sample is for example, heterozygous. With this setup, we can use fast bitwise operations and popcount hardware instructions to calculate relatedness extremely quickly.

For each sample-pair, it reports:

IBS0 -- the number of sites where one sample is hom-ref and another is hom-alt
IBS2 -- the number of sites where the samples have the same genotype
shared-hets -- the number of sites where both samples are heterozygotes
shared-hom-alts -- the number of sites where both samples are homozygous alternate

These are used to calculate relatedness and a measure of relatedness that is unaffected by loss-of-heterozygosity that is often seen in some cancers. The interactive output allows toggling between any of these measures.

It also reports depth information and the count of HET, HOM_REF, HOM_ALT, and unknown genotypes for each sample along with a number of metrics that are useful for general QC.

Example

Here, each point is a pair of samples. We can see that the expected identical sample-pairs (e.g. tumor-normal pairs) specified by the user and drawn in red mostly cluster together on the right. Unrelateds cluster on the lower left. The sample-swaps are the blue points that cluster with the red. In the somalier output, the user can hover to see which sample-pairs are involved each point

Ancestry Estimate

note: this feature is working, but still experimental. it may change in future versions.

somalier can predict ancestry on a set of query samples given a set of labelled samples, for example from thousand genomes along with labels for. This would look like:

somalier ancestry --labels ancestry-labels-1kg.tsv 1kg-somalier/*.somalier ++ query-samples-somalier/*.somalier

Where the ++ separates the labeled samples from the query samples. This command will create an html output along with a text file of the predictions.

ancestry-labels-1kg.tsv is here

and the somalier files for thousand genomes can be downloaded from here These were created from the thousand genomes high coverage data from here

Note that these will work for either GRCh37 or hg38 as long as you use the most recent sites files distributed with somalier.

Example output is here

Usage

Usage is intentionally very simple and running somalier extract or somalier relate will give sufficient help for nearly all cases.

By default somalier will only consider variants that have a "PASS" or "RefCall" FILTER. To extend this list, set the environment variable SOMALIER_ALLOWED_FILTERS to a comma-delimited list of additional filters to allow.

by default sites with an allele balance < 0.01 will be considered homozygous reference. To adjust this, use e.g. : SOMALIER_AB_HOM_CUTOFF=0.04 somalier relate ...

Other Work

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5499645/

https://academic.oup.com/bioinformatics/article/33/4/596/2624551

Acknowledgement

This work was motivated by interaction and discussions with Preeti Aahir and several early users who provided valuable feedback.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 128

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (22) 🔗