All Projects → google → Nucleus

google / Nucleus

Licence: other
Python and C++ code for reading and writing genomics data.

Projects that are alternatives of or similar to Nucleus

Bio.jl
[DEPRECATED] Bioinformatics and Computational Biology Infrastructure for Julia
Stars: ✭ 257 (-60.88%)
Mutual labels:  bioinformatics, genomics, dna
Gatk
Official code repository for GATK versions 4 and up
Stars: ✭ 1,002 (+52.51%)
Mutual labels:  bioinformatics, genomics, dna
Galaxy
Data intensive science for everyone.
Stars: ✭ 812 (+23.59%)
Mutual labels:  bioinformatics, genomics, dna
Sns
Analysis pipelines for sequencing data
Stars: ✭ 43 (-93.46%)
Mutual labels:  bioinformatics, genomics, dna
Biopython
Official git repository for Biopython (originally converted from CVS)
Stars: ✭ 2,936 (+346.88%)
Mutual labels:  bioinformatics, genomics, dna
awesome-genetics
A curated list of awesome bioinformatics software.
Stars: ✭ 60 (-90.87%)
Mutual labels:  bioinformatics, genomics, dna
Pyfaidx
Efficient pythonic random access to fasta subsequences
Stars: ✭ 307 (-53.27%)
Mutual labels:  bioinformatics, genomics, dna
Genomics
A collection of scripts and notes related to genomics and bioinformatics
Stars: ✭ 101 (-84.63%)
Mutual labels:  bioinformatics, genomics, dna
Deepvariant
DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
Stars: ✭ 2,404 (+265.91%)
Mutual labels:  bioinformatics, genomics, dna
catch
A package for designing compact and comprehensive capture probe sets.
Stars: ✭ 55 (-91.63%)
Mutual labels:  bioinformatics, genomics, dna
dna-traits
A fast 23andMe genome text file parser, now superseded by arv
Stars: ✭ 64 (-90.26%)
Mutual labels:  bioinformatics, genomics, dna
Vcfanno
annotate a VCF with other VCFs/BEDs/tabixed files
Stars: ✭ 259 (-60.58%)
Mutual labels:  bioinformatics, genomics
Postgui
A React web application to query and share any PostgreSQL database.
Stars: ✭ 260 (-60.43%)
Mutual labels:  bioinformatics, genomics
Biojava
📖🔬☕️ BioJava is an open-source project dedicated to providing a Java library for processing biological data.
Stars: ✭ 434 (-33.94%)
Mutual labels:  bioinformatics, genomics
Seq
A high-performance, Pythonic language for bioinformatics
Stars: ✭ 263 (-59.97%)
Mutual labels:  bioinformatics, genomics
Pygeno
Personalized Genomics and Proteomics. Main diet: Ensembl, side dishes: SNPs
Stars: ✭ 261 (-60.27%)
Mutual labels:  bioinformatics, genomics
Arvados
An open source platform for managing and analyzing biomedical big data
Stars: ✭ 274 (-58.3%)
Mutual labels:  bioinformatics, genomics
Jvarkit
Java utilities for Bioinformatics
Stars: ✭ 313 (-52.36%)
Mutual labels:  bioinformatics, genomics
varsome-api-client-python
Example client programs for Saphetor's VarSome annotation API
Stars: ✭ 21 (-96.8%)
Mutual labels:  bioinformatics, genomics
Gwa tutorial
A comprehensive tutorial about GWAS and PRS
Stars: ✭ 303 (-53.88%)
Mutual labels:  bioinformatics, genomics

Nucleus

Nucleus is a library of Python and C++ code designed to make it easy to read, write and analyze data in common genomics file formats like SAM and VCF. In addition, Nucleus enables painless integration with the TensorFlow machine learning framework, as anywhere a genomics file is consumed or produced, a TensorFlow tfrecords file may be used instead.

Tutorial

Please check out our tutorial on using Nucleus and TensorFlow for DNA sequencing error correction. It's a Python notebook that really demonstrates the power of Nucleus at integrating information from multiple file types (BAM, VCF and Fasta) and turning it into a form usable by TensorFlow.

Poll

Which of these would most increase your usage of Nucleus? (Click on an option to vote on it.)

Installation

Nucleus currently only works on modern Linux systems using Python 3. It must be installed using a version of pip less than 21. To determine the version of pip installed on your system, run

pip --version

To install Nucleus, run

pip install --user google-nucleus

Note that Nucleus doesn't yet work with Python 3.8. Also, you can ignore any "Failed building wheel for google-nucleus" error messages -- these are expected and won't prevent Nucleus from installing successfully.

If you are using Python 2, instead run

pip install --user google-nucleus==0.3.2

Documentation

Building from source

For Ubuntu 14, Ubuntu 16, Ubuntu 18 and Debian 9 systems, building from source is easy. Simply type

source install.sh

For all other systems, you will need to first install CLIF by following the instructions at https://github.com/google/clif#installation before running install.sh. You'll need to run this command with Python 3.6 or 3.7.

Note that install.sh extensively depends on apt-get, so it is unlikely to run without extensive modifications on non-Debian-based systems.

Nucleus depends on TensorFlow. By default, install.sh will install a CPU-only version of a stable TensorFlow release (currently 2.4). If that isn't what you want, there are several other options that can be enabled with a simple edit to install.sh.

Running install.sh will build all of Nucleus's programs and libraries. You can find the generated binaries under bazel-bin/nucleus. If in addition to building Nucleus you would like to run its tests, execute

bazel test -c opt $BAZEL_FLAGS nucleus/...

Version

This is Nucleus 0.5.8. Nucleus follows semantic versioning.

New in 0.5.8:

  • Update util/vis.py to use updated channel names.
  • Support MED_DP (median DP) field for a VariantCall.

New in 0.5.7:

  • Add automatic pileup curation functionality in util/vis.py.
  • Upgrade protobuf settings to support TensorFlow 2.4.0 specifically.

New in 0.5.6:

  • Upgrade to protobuf 3.9.2 to support TensorFlow 2.3.0 specifically.

New in 0.5.5:

  • Upgrade protobuf settings to support TensorFlow 2.2.0 specifically.

New in 0.5.4:

  • Upgrade to protobuf 3.8.0 to support TensorFlow 2.1.0. * Add explicit .close() method to TFRecordWriter.

New in 0.5.3:

  • Fixes memory leaks in message_module.cc.
  • Updates setup.py to install .egg-info directory for pip 20.2+ compatibility.
  • Pins TensorFlow to 2.0.0 for protobuf version compatibility.
  • Pins setuptools to 49.6.0 to avoid breaking changes of setuptools 50.

New in 0.5.2:

  • Upgrades htslib dependency from 1.9 to 1.10.2.
  • More informative error message for failed SAM header parsing.
  • util/vis.py now supports saving images to Google Cloud Storage.

New in 0.5.1:

  • Added new utilities for working with DeepVariant pileup images and variant protos.

New in 0.5.0:

  • Fixed bug preventing Nucleus to work with TensorFlow 2.0.
  • Added util.vis routines for visualizing DeepVariant pileup examples.
  • FASTA reader now supports keep_true_case option for keeping the original casing.
  • VCF writer now supports writing headerless VCF files.
  • SAM reader now supports optional fields of type 'B'.
  • variant_utils now supports gVCF files.
  • Numerous minor bug fixes.

New in 0.4.1:

  • Pip package is slightly more robust.

New in 0.4.0:

  • The Nucleus pip package now works with Python 3.

New in 0.3.0:

  • Reading of VCF, SAM, and most other genomics files is now twice as fast.
  • Read range and end calculations are now done in C++ for speed.
  • VcfReader can now read "headerless" VCF files.
  • variant_utils.major_allele_frequency now 5x faster.
  • Memory leaks fixed in TFRecordReader/Writer and gfile_cc.

New in 0.2.3:

  • Nucleus no longer depends on any specific version of TensorFlow's python code. This should make it easier to use Nucleus with for example TensorFlow 2.0.
  • Added BCF support to VcfWriter.
  • Fixed memory leaks in VcfWriter::Write.
  • Added print_tfrecord example program.

New in 0.2.2:

  • Faster SAM file querying and read overlap calculations.
  • Writing protocol buffers to files uses less memory.
  • Smaller pip package.
  • nucleus/util:io_utils refactored into nucleus/io:tfrecord and nucleus/io:sharded_file_utils.
  • Alleles coming from VCF files are now always normalized as uppercase.

New in 0.2.1:

  • Upgrades htslib dependency from 1.6 to 1.9.
  • Minor VCF parsing fixes.
  • Added new example program, apply_genotyping_prior.
  • Slightly more robust pip package.

New in 0.2.0:

  • Support for reading and writing BedGraph files.
  • Support for reading and writing GFF files.
  • Support for reading and writing CRAM files.
  • Support for writing SAM/BAM files.
  • Support for reading unindexed FASTA files.
  • Iteration support for indexed FASTA files.
  • Ability to read VCF files from memory.
  • Python API documentation.
  • Python 3 compatibility.
  • Added universal file converter example program.

License

Nucleus is licensed under the terms of the Apache 2 license.

Support

The Genomics team in Google Brain actively supports Nucleus and are always interested in improving its quality. If you run into an issue, please report the problem on our Issue tracker. Be sure to add enough detail to your report that we can reproduce the problem and fix it. We encourage including links to snippets of BAM/VCF/etc files that provoke the bug, if possible. Depending on the severity of the issue we may patch Nucleus immediately with the fix or roll it into the next release.

Contributing

Interested in contributing? See CONTRIBUTING.

History

Nucleus grew out of the DeepVariant project.

Disclaimer

This is not an official Google product.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].