All Projects → fritzsedlazeck → SVCollector

fritzsedlazeck / SVCollector

Licence: MIT License
Method to optimally select samples for validation and resequencing

Programming Languages

C++
36643 projects - #6 most used programming language
Makefile
30231 projects
perl
6916 projects
r
7636 projects
shell
77523 projects

Projects that are alternatives of or similar to SVCollector

Htslib
C library for high-throughput sequencing data formats
Stars: ✭ 529 (+2545%)
Mutual labels:  bioinformatics, ngs, vcf
Deepvariant
DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
Stars: ✭ 2,404 (+11920%)
Mutual labels:  bioinformatics, ngs
Survivor
Toolset for SV simulation, comparison and filtering
Stars: ✭ 180 (+800%)
Mutual labels:  bioinformatics, vcf
Hap.py
Haplotype VCF comparison tools
Stars: ✭ 249 (+1145%)
Mutual labels:  bioinformatics, vcf
Biosyntax
Syntax highlighting for computational biology
Stars: ✭ 164 (+720%)
Mutual labels:  bioinformatics, vcf
Fgbio
Tools for working with genomic and high throughput sequencing data.
Stars: ✭ 166 (+730%)
Mutual labels:  bioinformatics, ngs
CONSENT
Scalable long read self-correction and assembly polishing with multiple sequence alignment
Stars: ✭ 47 (+135%)
Mutual labels:  ngs, long-reads
Ugene
UGENE is free open-source cross-platform bioinformatics software
Stars: ✭ 112 (+460%)
Mutual labels:  bioinformatics, ngs
pipeline-structural-variation
Pipeline for calling structural variations in whole genomes sequencing Oxford Nanopore data
Stars: ✭ 104 (+420%)
Mutual labels:  structural-variation, long-reads
catch
A package for designing compact and comprehensive capture probe sets.
Stars: ✭ 55 (+175%)
Mutual labels:  bioinformatics, ngs
myVCF
myVCF: a web-based platform for target and exome mutations data management
Stars: ✭ 18 (-10%)
Mutual labels:  ngs, ngstools
atropos
An NGS read trimming tool that is specific, sensitive, and speedy. (production)
Stars: ✭ 109 (+445%)
Mutual labels:  bioinformatics, ngs
Scde
R package for analyzing single-cell RNA-seq data
Stars: ✭ 147 (+635%)
Mutual labels:  bioinformatics, ngs
Afterqc
Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data
Stars: ✭ 169 (+745%)
Mutual labels:  bioinformatics, ngs
Ngless
NGLess: NGS with less work
Stars: ✭ 115 (+475%)
Mutual labels:  bioinformatics, ngs
Cyvcf2
cython + htslib == fast VCF and BCF processing
Stars: ✭ 243 (+1115%)
Mutual labels:  bioinformatics, vcf
PHAT
Pathogen-Host Analysis Tool - A modern Next-Generation Sequencing (NGS) analysis platform
Stars: ✭ 17 (-15%)
Mutual labels:  bioinformatics, ngs
Genomics
A collection of scripts and notes related to genomics and bioinformatics
Stars: ✭ 101 (+405%)
Mutual labels:  bioinformatics, vcf
Bioconvert
Bioconvert is a collaborative project to facilitate the interconversion of life science data from one format to another.
Stars: ✭ 112 (+460%)
Mutual labels:  bioinformatics, ngs
Circle-Map
A method for circular DNA detection based on probabilistic mapping of ultrashort reads
Stars: ✭ 45 (+125%)
Mutual labels:  ngs, structural-variation

SVCollector: Optimized sample selection for cost-efficient long-read population sequencing

Structural Variations (SVs) are increasingly recognized for their importance in genomics. Short-read sequencing is the most widely-used approach for genotyping large numbers of samples for SVs but suffers from relatively poor accuracy. Here we present SVCollector, an open-source method that optimally selects samples to maximize variant discovery and validation using long read resequencing or PCR-based validation. SVCollector has two modes: selecting those samples that are individually the most diverse or those that collectively capture the largest number of variations.

If you experience problems or have suggestions please post an issue here or contact: [email protected]

How to build SVCollector

$ wget https://github.com/fritzsedlazeck/SVCollector/archive/master.zip -O SVCollector.tar.gz
$ tar xzvf SVCollector.tar.gz
$ cd SVCollector-master/Debug
$ make

$ ./SVCollector

Running SVCollector

SVCollector can be run in a few different modes (greedy, topN, or random) with an input multi-sample VCF file and outputs a ranked list of the samples to select. The top level command is run like this:

$ ./Debug/SVCollector
./SVCollector <option> my_svs_vcf_file output_ranked
<option>: greedy, topN or random
my_svs_vcf_file: A valid uncompressed multisample VCF file.
num_samples: The number of samples that should be ranked.
output_ranked: The file to write out the ranked list with additional information.

Greedy Analysis

This is the recommended mode for all users as it will best optimize the selection of samples. The parameters for this mode are shown below.

$ ./Debug/SVCollector greedy
Input VCF file
Min allele count (-1 to disable)
Number of samples to select
Take AF into account (1) or not (0) per allele
Optionally: File of names to select anyways (NA to disable)
Optionally: Text File of names and weights (NA to disable)
Output file

topN Analysis

This mode is provided for comparison purposes to evaluate how the greedy mode compares to this simplier selection mode. The parameters for this mode are shown below.

$ ./Debug/SVCollector topN
Input VCF file
Number of samples to select
Take AF into account (1) or not (0) per allele
Output file

random Analysis

This is the most naive approach that just picks N samples at random from the entire input VCF file. The parameters for this mode are shown below.

$ ./Debug/SVCollector random
Input VCF file
Number of samples to select
Take AF into account (1) or not (0) per allele
Output file

Running all modes

We also provide a helper script (SVCollector.sh) that will run all 3 modes (greedy, topN, and 10 trials of a random selection) and make a simple plot comparing the results over the first numtoplot samples from the input VCF file. If you have GNU Parallel installed you should edit the script to replace the for loop with the much faster parallel version.

$ ./SVCollector.sh
USAGE: SVCollector.sh samples.vcf numtoplot workdir

Demo

For evaluation, we include a simulation script that generates a multi-sample VCF file with an arbitrary population structure. Briefly, the simulator simulates F founder genotypes, that each contain on average Normal(N,M) variants placed at random along the genome (the initial genome size is fixed at 100,000,000 bp). Then for each founder population, a collection of Normal(S,T) individual samples are generated at random that contain the original founder variants plus an additional Normal(X,Y) variants. Consequently, the expected total number of variants in the collection is F * N + F * S * X variants. If N > S, then most of the variants will be shared within the population group, and if S > N, most variants will be unique to that sample. We emphasize this is not designed to simulate realistic pedigrees, but to examine the extremes of high or low levels of sharing among the individuals.

After compiling the code, you can run the demo like this:

$ cd SVCollector/simul
$ ./demo.sh

Note you will need to have perl and the Math/Random CPAN package installed. This can be installed with conda as:

$ conda install perl-math-random

The demo script will generate 2 simulated populations (simple10.vcf and complex10.vcf) and 2 working directories with the results of a greedy selection, a topN selection, and 10 trials of a random selection from these populations.

The first simulated population (simple10.vcf) has 10 founder genomes with exactly 1000 variants located at random. From each founder genome, 10 samples are simulated that contain the 1000 founder variants plus an additional 100 variants. The sharp inflection point for the greedy curve at N=10 illustrates how the code realizes there are 10 founder populations. After these 10 populations have been sampled, the rate at which additional SVs are identified reduces to a much slower rate as these variants are only contained in individual samples. The plot will be generated to simul/simple10/simple10.vcf.png.

The second simulated population (complex10.vcf) also has 10 founder genomes that each randomly contain Normal(500,250) variants. From each founder population, Normal(10, 5) samples are simulated that contain the founder variants, plus an additional Normal(500,250) variants unique to this sample. Notice that despite having the same number of founder genomes, the curves are substantially different, and lack the inflection point at N=10. This highlights how sample specific variants contribute at a similar level to the founder genotypes. The plot will be generated to simul/complex10/complex10.vcf.png.

Citation

Please cite our preprint: https://www.biorxiv.org/content/10.1101/2020.08.06.240390v1

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].