Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

Stars: ✭ 22,048 (+84700%)

Mutual labels: data-science, spark

View All Similar Projects ➔

TileDB-VCF

A C++ library for efficient storage and retrieval of genomic variant-call data using TileDB Embedded.

Features

Easily ingest large amounts of variant-call data at scale
Supports ingesting single sample VCF and BCF files
New samples are added incrementally, avoiding computationally expensive merging operations
Allows for highly compressed storage using TileDB sparse arrays
Efficient, parallelized queries of variant data stored locally or remotely on S3
Export lossless VCF/BCF files or extract specific slices of a dataset

What's Included?

Command line interface (CLI)
APIs for C, C++, Python, and Java
Integrates with Spark and Dask

Quick Start

The documentation website provides comprehensive usage examples but here are a few quick exercises to get you started.

We'll use a dataset that includes 20 synthetic samples, each one containing over 20 million variants. We host a publicly accessible version of this dataset on S3, so if you have TileDB-VCF installed and you'd like to follow along just swap out the uri's below for s3://tiledb-inc-demo-data/tiledbvcf-arrays/v4/vcf-samples-20. And if you don't have TileDB-VCF installed yet, you can use our Docker images to test things out.

CLI

Export complete chr1 BCF files for a subset of samples:

tiledbvcf export \
  --uri vcf-samples-20 \
  --regions chr1:1-248956422 \
  --sample-names v2-usVwJUmo,v2-WpXCYApL

Create a TSV file containing all variants within one or more regions of interest:

tiledbvcf export \
  --uri vcf-samples-20 \
  --sample-names v2-tJjMfKyL,v2-eBAdKwID \
  -Ot --tsv-fields "CHR,POS,REF,S:GT" \
  --regions "chr7:144000320-144008793,chr11:56490349-56491395"

Python

Running the same query in python

import tiledbvcf

ds = tiledbvcf.Dataset(uri = "vcf-samples-20", mode="r")

ds.read(
    attrs = ["sample_name", "pos_start", "fmt_GT"],
    regions = ["chr7:144000320-144008793", "chr11:56490349-56491395"],
    samples = ["v2-tJjMfKyL", "v2-eBAdKwID"]
)

returns results as a pandas DataFrame

     sample_name  pos_start    fmt_GT
0    v2-nGEAqwFT  143999569  [-1, -1]
1    v2-tJjMfKyL  144000262  [-1, -1]
2    v2-tJjMfKyL  144000518  [-1, -1]
3    v2-nGEAqwFT  144000339  [-1, -1]
4    v2-nzLyDgYW  144000102  [-1, -1]
..           ...        ...       ...
566  v2-nGEAqwFT   56491395    [0, 0]
567  v2-ijrKdkKh   56491373    [0, 0]
568  v2-eBAdKwID   56491391    [0, 0]
569  v2-tJjMfKyL   56491392  [-1, -1]
570  v2-nzLyDgYW   56491365  [-1, -1]

Want to Learn More?

Code of Conduct

All participants in TileDB spaces are expected to adhere to high standards of professionalism in all interactions. This repository is governed by the specific standards and reporting procedures detailed in depth in the TileDB core repository Code Of Conduct.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 26

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (11) 🔗