All Projects → TileDB-Inc → Tiledb Vcf

TileDB-Inc / Tiledb Vcf

Licence: mit
Efficient variant-call data storage and retrieval library using the TileDB storage library.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Tiledb Vcf

Hail
Scalable genomic data analysis.
Stars: ✭ 706 (+2615.38%)
Mutual labels:  spark, bioinformatics, genomics, vcf
Svtyper
Bayesian genotyper for structural variants
Stars: ✭ 79 (+203.85%)
Mutual labels:  bioinformatics, genomics, vcf
Cyvcf2
cython + htslib == fast VCF and BCF processing
Stars: ✭ 243 (+834.62%)
Mutual labels:  bioinformatics, genomics, vcf
Genomics
A collection of scripts and notes related to genomics and bioinformatics
Stars: ✭ 101 (+288.46%)
Mutual labels:  bioinformatics, genomics, vcf
Hap.py
Haplotype VCF comparison tools
Stars: ✭ 249 (+857.69%)
Mutual labels:  bioinformatics, genomics, vcf
Gatk
Official code repository for GATK versions 4 and up
Stars: ✭ 1,002 (+3753.85%)
Mutual labels:  spark, bioinformatics, genomics
Deep Rules
Ten Quick Tips for Deep Learning in Biology
Stars: ✭ 179 (+588.46%)
Mutual labels:  data-science, bioinformatics, genomics
Vcfanno
annotate a VCF with other VCFs/BEDs/tabixed files
Stars: ✭ 259 (+896.15%)
Mutual labels:  bioinformatics, genomics, vcf
Pygeno
Personalized Genomics and Proteomics. Main diet: Ensembl, side dishes: SNPs
Stars: ✭ 261 (+903.85%)
Mutual labels:  bioinformatics, genomics, vcf
Bwa Mem2
The next version of bwa-mem
Stars: ✭ 408 (+1469.23%)
Mutual labels:  bioinformatics, genomics
Agile data code 2
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
Stars: ✭ 413 (+1488.46%)
Mutual labels:  data-science, spark
Deeptools
Tools to process and analyze deep sequencing data.
Stars: ✭ 448 (+1623.08%)
Mutual labels:  bioinformatics, genomics
Jcvi
Python library to facilitate genome assembly, annotation, and comparative genomics
Stars: ✭ 404 (+1453.85%)
Mutual labels:  bioinformatics, genomics
Jbrowse
A modern genome browser built with JavaScript and HTML5.
Stars: ✭ 393 (+1411.54%)
Mutual labels:  bioinformatics, genomics
Biojava
📖🔬☕️ BioJava is an open-source project dedicated to providing a Java library for processing biological data.
Stars: ✭ 434 (+1569.23%)
Mutual labels:  bioinformatics, genomics
Bowtie2
A fast and sensitive gapped read aligner
Stars: ✭ 365 (+1303.85%)
Mutual labels:  bioinformatics, genomics
Ncbi Genome Download
Scripts to download genomes from the NCBI FTP servers
Stars: ✭ 494 (+1800%)
Mutual labels:  bioinformatics, genomics
Htslib
C library for high-throughput sequencing data formats
Stars: ✭ 529 (+1934.62%)
Mutual labels:  bioinformatics, vcf
Megahit
Ultra-fast and memory-efficient (meta-)genome assembler
Stars: ✭ 343 (+1219.23%)
Mutual labels:  bioinformatics, genomics
Data Science Ipython Notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Stars: ✭ 22,048 (+84700%)
Mutual labels:  data-science, spark

TileDB logo

Build Status Docker-CLI Docker-Py

TileDB-VCF

A C++ library for efficient storage and retrieval of genomic variant-call data using TileDB Embedded.

Features

  • Easily ingest large amounts of variant-call data at scale
  • Supports ingesting single sample VCF and BCF files
  • New samples are added incrementally, avoiding computationally expensive merging operations
  • Allows for highly compressed storage using TileDB sparse arrays
  • Efficient, parallelized queries of variant data stored locally or remotely on S3
  • Export lossless VCF/BCF files or extract specific slices of a dataset

What's Included?

  • Command line interface (CLI)
  • APIs for C, C++, Python, and Java
  • Integrates with Spark and Dask

Quick Start

The documentation website provides comprehensive usage examples but here are a few quick exercises to get you started.

We'll use a dataset that includes 20 synthetic samples, each one containing over 20 million variants. We host a publicly accessible version of this dataset on S3, so if you have TileDB-VCF installed and you'd like to follow along just swap out the uri's below for s3://tiledb-inc-demo-data/tiledbvcf-arrays/v4/vcf-samples-20. And if you don't have TileDB-VCF installed yet, you can use our Docker images to test things out.

CLI

Export complete chr1 BCF files for a subset of samples:

tiledbvcf export \
  --uri vcf-samples-20 \
  --regions chr1:1-248956422 \
  --sample-names v2-usVwJUmo,v2-WpXCYApL

Create a TSV file containing all variants within one or more regions of interest:

tiledbvcf export \
  --uri vcf-samples-20 \
  --sample-names v2-tJjMfKyL,v2-eBAdKwID \
  -Ot --tsv-fields "CHR,POS,REF,S:GT" \
  --regions "chr7:144000320-144008793,chr11:56490349-56491395"

Python

Running the same query in python

import tiledbvcf

ds = tiledbvcf.Dataset(uri = "vcf-samples-20", mode="r")

ds.read(
    attrs = ["sample_name", "pos_start", "fmt_GT"],
    regions = ["chr7:144000320-144008793", "chr11:56490349-56491395"],
    samples = ["v2-tJjMfKyL", "v2-eBAdKwID"]
)

returns results as a pandas DataFrame

     sample_name  pos_start    fmt_GT
0    v2-nGEAqwFT  143999569  [-1, -1]
1    v2-tJjMfKyL  144000262  [-1, -1]
2    v2-tJjMfKyL  144000518  [-1, -1]
3    v2-nGEAqwFT  144000339  [-1, -1]
4    v2-nzLyDgYW  144000102  [-1, -1]
..           ...        ...       ...
566  v2-nGEAqwFT   56491395    [0, 0]
567  v2-ijrKdkKh   56491373    [0, 0]
568  v2-eBAdKwID   56491391    [0, 0]
569  v2-tJjMfKyL   56491392  [-1, -1]
570  v2-nzLyDgYW   56491365  [-1, -1]

Want to Learn More?

Code of Conduct

All participants in TileDB spaces are expected to adhere to high standards of professionalism in all interactions. This repository is governed by the specific standards and reporting procedures detailed in depth in the TileDB core repository Code Of Conduct.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].