All Projects â†’ dancooke â†’ starfish

dancooke / starfish

Licence: MIT license
Intersect multiple VCF files with haplotype awareness

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to starfish

csv2vcf
🔧 Simple script in python to convert CSV files to VCF
Stars: ✭ 66 (+247.37%)
Mutual labels:  variant-calls
CliqueSNV
No description or website provided.
Stars: ✭ 13 (-31.58%)
Mutual labels:  haplotypes
clipper
Implementation for the clipper library in rhino and grasshopper.
Stars: ✭ 55 (+189.47%)
Mutual labels:  intersection
rvtests
Rare variant test software for next generation sequencing data
Stars: ✭ 114 (+500%)
Mutual labels:  vcf-files
ofxRaycaster
Plane, 2D and 3D Ray objects for openFrameworks.It checks for the intersection of a ray with a segment, a sphere, a triangle, a plane, an ofPrimitive, an ofPolyline an with an ofMesh.
Stars: ✭ 54 (+184.21%)
Mutual labels:  intersection
minorseq
Minor Variant Calling and Phasing Tools
Stars: ✭ 15 (-21.05%)
Mutual labels:  haplotypes
set-sketch-paper
SetSketch: Filling the Gap between MinHash and HyperLogLog
Stars: ✭ 23 (+21.05%)
Mutual labels:  intersection
hyperdiff
Find common, removed and added element between two collections.
Stars: ✭ 14 (-26.32%)
Mutual labels:  intersection
SecureUnionID
Secure ECC-based DID intersection in Go, Java and C.
Stars: ✭ 19 (+0%)
Mutual labels:  intersection
SNPGenie
Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
Stars: ✭ 81 (+326.32%)
Mutual labels:  vcf-files
awesome-cogsci
An Awesome List of Cognitive Science Resources
Stars: ✭ 71 (+273.68%)
Mutual labels:  intersection
bentley-ottmann
simple Java implementation of Bentley-Ottmann sweep line algorithm for listing all intersections in a set of line segments
Stars: ✭ 16 (-15.79%)
Mutual labels:  intersection
phenomenet-vp
A phenotype-based tool for variant prioritization in WES and WGS data
Stars: ✭ 31 (+63.16%)
Mutual labels:  vcf-files
vscode-diff
Compare two folders in Visual Studio Code
Stars: ✭ 66 (+247.37%)
Mutual labels:  compare-files
spark-vcf
Spark VCF data source implementation for Dataframes
Stars: ✭ 15 (-21.05%)
Mutual labels:  vcf-files
SmartTrafficIntersection
Another AI toy project, of a traffic intersection controlled by a Reinforcement Learning AI agent to optimize traffic flow in an intersection of vehicles or pedestrians
Stars: ✭ 30 (+57.89%)
Mutual labels:  intersection
learning vcf file
Learning the Variant Call Format
Stars: ✭ 104 (+447.37%)
Mutual labels:  vcf-files
vcfstats
Powerful statistics for VCF files
Stars: ✭ 32 (+68.42%)
Mutual labels:  vcf-files
interval
This PHP library provides some tools to handle intervals. For instance, you can compute the union or intersection of two intervals.
Stars: ✭ 25 (+31.58%)
Mutual labels:  intersection
intersection-wasm
Mesh-Mesh and Triangle-Triangle Intersection tests based on the algorithm by Tomas Akenine-Möller
Stars: ✭ 17 (-10.53%)
Mutual labels:  intersection

Starfish

MIT license

Starfish is a tool for comparing and intersecting multiple VCF files with haplotype awareness by using the powerful RTGTools vcfeval engine. The name "Starfish" comes from the shape of the Venn diagram the program can draw (with 5 VCFs!).

Starfish Venn

Requirements

Installation

git clone --recursive https://github.com/dancooke/starfish

Usage

There are just three required options:

  • --sdf (short -t): The RTG Tools SDF reference directory (use rtg format)
  • --variants (short -V): A list of VCF files to intersect.
  • --output (short -O): A directory path to write intersections.

For example:

./starfish \
    -t reference.sdf \
    -V vcf1.vcf.gz vcf2.vcf.gz vcf3.vcf.gz \
    -O isec

Will result in the directory isec containing the following files:

  • A.vcf.gz: Records unique to vcf1.vcf.gz.
  • B.vcf.gz: Records unique to vcf2.vcf.gz.
  • C.vcf.gz: Records unique to vcf3.vcf.gz.
  • AB.vcf.gz: Records in vcf1.vcf.gz and vcf2.vcf.gz but not vcf3.vcf.gz.
  • AC.vcf.gz: Records in vcf1.vcf.gz and vcf3.vcf.gz but not vcf2.vcf.gz.
  • BC.vcf.gz: Records in vcf2.vcf.gz and vcf3.vcf.gz but not vcf1.vcf.gz.
  • ABC.vcf.gz: Records common to vcf1.vcf.gz, vcf2.vcf.gz, and vcf3.vcf.gz.

In other words, the VCF files are labelled (in order) using upper-case letters, and the filenames in the output directory contain records unique to the labels in the filename.

Restricting comparison to certain regions

By default, all regions in the reference genome (which must be the same for all input VCFs) are used. To restrict comparison to a subset of regions, supply a BED file to the --regions option.

Ignoring filtered records

By default, records that are filtered are not included in the comparison. To include them add the --all-records option the your command.

Ignoring genotype mismatches

By default, records will not be matched if the genotypes do not match. To ignore genotype mismatches (and only compare called alleles), use the --squash-ploidy option:

./starfish \
    -t reference.sdf \
    -V vcf1.vcf.gz vcf2.vcf.gz vcf3.vcf.gz \
    -O isec \
    --squash-ploidy

Comparing samples without genotypes

To compare callsets without genotypes; only use ALT alleles:

./starfish \
    -t reference.sdf \
    -V vcf1.vcf.gz vcf2.vcf.gz vcf3.vcf.gz \
    -O isec \
    --samples ALT \
    --squash-ploidy

Drawing Venn diagrams

Starfish can draw Venn diagrams showing the number of intersected records for up to 6 VCFs (if the pyvenn package is installed). To do this you need to supply names for each of the VCFs with the --names option and add the --venn command:

./starfish \
    -t reference.sdf \
    -V vcf1.vcf.gz vcf2.vcf.gz vcf3.vcf.gz \
    -O isec \
    --names Octopus GATK4 FreeBayes \
    --venn

Limiations

Starfish has a number of limitations:

  • Only haploid and diploid genotype comparisons are supported (due to RTGTools vcfeval).
  • Only one sample can be compared. You can use the --sample option if your VCFs have multiple samples, but the given sample must be present in all input VCFs.
  • The number of unique intersections grows exponentially with the number of input VCFs.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].