All Projects → lifeomic → spark-vcf

lifeomic / spark-vcf

Licence: MIT license
Spark VCF data source implementation for Dataframes

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to spark-vcf

rvtests
Rare variant test software for next generation sequencing data
Stars: ✭ 114 (+660%)
Mutual labels:  variants, genotype, vcf-files
CuteVCF
simple viewer for variant call format using htslib
Stars: ✭ 30 (+100%)
Mutual labels:  genomics, vcf, variants
Svtyper
Bayesian genotyper for structural variants
Stars: ✭ 79 (+426.67%)
Mutual labels:  genomics, vcf
Genomics
A collection of scripts and notes related to genomics and bioinformatics
Stars: ✭ 101 (+573.33%)
Mutual labels:  genomics, vcf
Hap.py
Haplotype VCF comparison tools
Stars: ✭ 249 (+1560%)
Mutual labels:  genomics, vcf
phenomenet-vp
A phenotype-based tool for variant prioritization in WES and WGS data
Stars: ✭ 31 (+106.67%)
Mutual labels:  variants, vcf-files
Tiledb Vcf
Efficient variant-call data storage and retrieval library using the TileDB storage library.
Stars: ✭ 26 (+73.33%)
Mutual labels:  genomics, vcf
Cyvcf2
cython + htslib == fast VCF and BCF processing
Stars: ✭ 243 (+1520%)
Mutual labels:  genomics, vcf
Ontologies
Home of the Genomic Feature and Variation Ontology (GFVO)
Stars: ✭ 16 (+6.67%)
Mutual labels:  genomics, vcf
HLA
xHLA: Fast and accurate HLA typing from short read sequence data
Stars: ✭ 84 (+460%)
Mutual labels:  genomics, variants
SNPGenie
Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
Stars: ✭ 81 (+440%)
Mutual labels:  vcf, vcf-files
Hail
Scalable genomic data analysis.
Stars: ✭ 706 (+4606.67%)
Mutual labels:  genomics, vcf
Pygeno
Personalized Genomics and Proteomics. Main diet: Ensembl, side dishes: SNPs
Stars: ✭ 261 (+1640%)
Mutual labels:  genomics, vcf
Genozip
Compressor for genomic files (FASTQ, SAM/BAM, VCF, FASTA, GVF, 23andMe...), up to 5x better than gzip and faster too
Stars: ✭ 53 (+253.33%)
Mutual labels:  genomics, vcf
Vcfanno
annotate a VCF with other VCFs/BEDs/tabixed files
Stars: ✭ 259 (+1626.67%)
Mutual labels:  genomics, vcf
Htsjdk
A Java API for high-throughput sequencing data (HTS) formats.
Stars: ✭ 220 (+1366.67%)
Mutual labels:  genomics, vcf
MTBseq source
MTBseq is an automated pipeline for mapping, variant calling and detection of resistance mediating and phylogenetic variants from illumina whole genome sequence data of Mycobacterium tuberculosis complex isolates.
Stars: ✭ 26 (+73.33%)
Mutual labels:  genomics, variants
rare-disease-wf
(WIP) best-practices workflow for rare disease
Stars: ✭ 47 (+213.33%)
Mutual labels:  genomics, variants
cljam
A DNA Sequence Alignment/Map (SAM) library for Clojure
Stars: ✭ 85 (+466.67%)
Mutual labels:  genomics, vcf
vcf stuff
📊Evaluating, filtering, comparing, and visualising VCF
Stars: ✭ 19 (+26.67%)
Mutual labels:  vcf, variants

spark-vcf

Spark VCF data source implementation in native spark.

Introduction

Spark VCF allows you to natively load VCFs into an Apache Spark Dataframe/Dataset. To get started with Spark-VCF, you can clone or download this repository, then run mvn package and use the jar. We are also now in Maven central.

Since spark-vcf is written specifically for Spark, there is less overhead and performance gains in many areas.

Installation

Spark-vcf can be packaged from source or added as a dependency to your Maven based project.

To install spark vcf, add the following to your pom:

<dependency>
  <groupId>com.lifeomic</groupId>
  <artifactId>spark-vcf</artifactId>
  <version>0.3.0</version>
</dependency>

For sbt:

libraryDependencies += "com.lifeomic" % "spark-vcf" % "0.3.0"

If you are using gradle, the dependency is:

compile group: 'com.lifeomic', name: 'spark-vcf', version: '0.3.0'

Getting Started

Getting started with Spark VCF is as simple as:

val myVcf = spark.read
    .format("com.lifeomic.variants")
    .load("src/test/resources/example.vcf")

The schema contains the standard vcf columns and has the options to expand INFO and/or FORMAT columns. An example schema from 1000 genomes is shown below:

 |-- chrom: string (nullable = true)
 |-- pos: long (nullable = true)
 |-- start: long (nullable = true)
 |-- stop: long (nullable = true)
 |-- id: string (nullable = true)
 |-- ref: string (nullable = true)
 |-- alt: string (nullable = true)
 |-- qual: string (nullable = true)
 |-- filter: string (nullable = true)
 |-- info: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- gt: string (nullable = true)
 |-- sampleid: string (nullable = true)

There are options that you can use as well for the Format and Info columns. To return the format fields as a map, instead of separate fields, you can set the use.format.map variable to true. This can be used to speed up the spark job even more, as it doesn't have to read the header file for type and column information.

val mappedFormat = spark.read
    .format("com.lifeomic.variants")
    .option("use.format.map", "true")
    .load("src/test/resources/example.vcf")

You can also stringly type the formats as well by setting use.format.type to false.

One more note worth mentioning: while the core of spark-vcf is written as a Spark data source, it is still advisable to use the BGZFEnhancedGzipCodec from Hadoop-BAM for splitting bgzip files, so that Spark can properly partition the files. For example:

val sparkConf = new SparkConf()
        .setAppName("testing")
        .setMaster("local[8]")
        .set("spark.hadoop.io.compression.codecs", "org.seqdoop.hadoop_bam.util.BGZFEnhancedGzipCodec")

TODO

  • Provide performance benchmarks compared to other libraries
  • Get Travis CI set up

License

The MIT License

Copyright 2017 Lifeomic

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].