All Projects → carjed → Helmsman

carjed / Helmsman

Licence: mit
highly-efficient & lightweight mutation signature matrix aggregation

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Helmsman

Genomics
A collection of scripts and notes related to genomics and bioinformatics
Stars: ✭ 101 (+431.58%)
Mutual labels:  bioinformatics, sequencing, vcf
Truvari
Structural variant toolkit for VCFs
Stars: ✭ 85 (+347.37%)
Mutual labels:  bioinformatics, sequencing, vcf
Htslib
C library for high-throughput sequencing data formats
Stars: ✭ 529 (+2684.21%)
Mutual labels:  bioinformatics, vcf
saffrontree
SaffronTree: Reference free rapid phylogenetic tree construction from raw read data
Stars: ✭ 17 (-10.53%)
Mutual labels:  bioinformatics, sequencing
plasmidtron
Assembling the cause of phenotypes and genotypes from NGS data
Stars: ✭ 27 (+42.11%)
Mutual labels:  bioinformatics, sequencing
Sequenceserver
Intuitive local web frontend for the BLAST bioinformatics tool
Stars: ✭ 198 (+942.11%)
Mutual labels:  bioinformatics, sequencing
Cyvcf2
cython + htslib == fast VCF and BCF processing
Stars: ✭ 243 (+1178.95%)
Mutual labels:  bioinformatics, vcf
SVCollector
Method to optimally select samples for validation and resequencing
Stars: ✭ 20 (+5.26%)
Mutual labels:  bioinformatics, vcf
Rnaseq Workflow
A repository for setting up a RNAseq workflow
Stars: ✭ 170 (+794.74%)
Mutual labels:  bioinformatics, sequencing
gff3toembl
Converts Prokka GFF3 files to EMBL files for uploading annotated assemblies to EBI
Stars: ✭ 27 (+42.11%)
Mutual labels:  bioinformatics, sequencing
gencore
Generate duplex/single consensus reads to reduce sequencing noises and remove duplications
Stars: ✭ 91 (+378.95%)
Mutual labels:  bioinformatics, sequencing
Vcfanno
annotate a VCF with other VCFs/BEDs/tabixed files
Stars: ✭ 259 (+1263.16%)
Mutual labels:  bioinformatics, vcf
Deepvariant
DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
Stars: ✭ 2,404 (+12552.63%)
Mutual labels:  bioinformatics, sequencing
Survivor
Toolset for SV simulation, comparison and filtering
Stars: ✭ 180 (+847.37%)
Mutual labels:  bioinformatics, vcf
Hap.py
Haplotype VCF comparison tools
Stars: ✭ 249 (+1210.53%)
Mutual labels:  bioinformatics, vcf
Roary
Rapid large-scale prokaryote pan genome analysis
Stars: ✭ 176 (+826.32%)
Mutual labels:  bioinformatics, sequencing
catch
A package for designing compact and comprehensive capture probe sets.
Stars: ✭ 55 (+189.47%)
Mutual labels:  bioinformatics, sequencing
Hail
Scalable genomic data analysis.
Stars: ✭ 706 (+3615.79%)
Mutual labels:  bioinformatics, vcf
Biosyntax
Syntax highlighting for computational biology
Stars: ✭ 164 (+763.16%)
Mutual labels:  bioinformatics, vcf
Afterqc
Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data
Stars: ✭ 169 (+789.47%)
Mutual labels:  bioinformatics, sequencing

Announcements

2020-07-22 CRITICAL BUGFIX

In versions 1.4.4 and earlier, when using Helmsman to process tab-delimited text input (i.e., with the --mode txt option), a bug in the code caused sample labels to be incorrectly sorted with respect to the final mutation counts, which resulted in mutation counts and spectra being scrambled across sample labels.

This issue did not affect any of the other input modes (e.g., --mode vcf, --mode agg, or --mode maf), and does not appear to have impacted the results of any papers or preprints that have used Helmsman. We urge any users that are processing data with --mode txt to ensure that they are using version 1.5.0+ of Helmsman.

Thanks to @jmcbroome for finding and fixing this bug.

DOI Docs License: MIT Binder Build Status

O be swift—
we have always known you wanted us.

(from 'The Helmsman' by Hilda Doolittle [1886 - 1961])


Introduction

Helmsman is a utility for rapidly and efficiently generating mutation spectra matrices from massive next-generation sequencing datasets, for use in a wide range of mutation signature analysis tools. See Alexandrov et al., Cell Reports, 2013 for a detailed explanation of how mutation signature analysis methods work, and why they are useful in studying cancer genomes.

Currently, the majority of mutation signature analysis methods are implemented as R packages. To generate the mutation spectra matrix, these packages must read the entire dataset (containing information about every individual mutation), into memory. This is generally not a problem for datasets containing only a few thousand SNVs or a few dozen samples, but for much larger datasets, it is extremely easy to exceed the physical memory capacity of your machine. Depending on the dimensions of the data, even very small files can take a very long time to process in these R packages--sometimes over 30 minutes for a 3Mb file!

Other mutation signature analysis tools do not even provide these convenience functions, and leave it to the user to coerce their data into program-specific formats. Not only is this inconvenient, but it impedes users' ability to port their data between tools and take advantage of the unique features provided by different packages.

Helmsman aims to alleviate these performance barriers and lack of standardization. Helmsman was initially written to evaluate patterns of variation in massive whole-genome datasets, containing tens or hundreds of millions of SNVs observed in tens of thousands of individuals. Generating the mutation spectra matrix for such data is virtually impossible with R-based implementations, but Helmsman provides a very fast and scalable solution for analyzing these datasets.

If you plan to use the output of Helmsman in other mutation signature analysis packages, Helmsman can automatically generate a small R script with all the code necessary to read the mutation spectra matrix and format it for compatibility with existing tools, using functions from the musigtools package.

Helmsman includes several other convenient features, including:

  • ability to pool samples together when generating the mutation spectra matrix
  • aggregate data from multiple VCF files
  • bare-bones (and really fast!) non-negative matrix factorization, to extract signatures without even relying on external packages
  • able to run in parallel

Setup

Using Conda (recommended)

The easiest way to start using Helmsman is to create a Conda environment, based on the dependencies specified in the env.yml file:

git clone https://github.com/carjed/helmsman.git
cd helmsman

conda env create -n helmsman -f env.yml
source activate helmsman

Using pip

If you do not have Conda on your system, the prerequisites for Helmsman can also be installed with pip inside of a python3 virtual environment (assuming you have virtualenv installed):

git clone https://github.com/carjed/helmsman.git
cd helmsman

virtualenv -p python3 helmsman_env
source helmsman_env/bin/activate

pip install -r pip_reqs.txt

It is also possible to forgo the virtual environment setup and use pip to install the necessary dependencies in your global site-packages directory, but this is not recommended as doing so may cause dependency conflicts between Helmsman and other programs/packages.

Docker

For more flexible deployment options, Helmsman is available as a Docker container. The following command will pull and run the preconfigured image from the Docker Hub:

docker run -d --name helmsman \
  -v /path/to/local/data:/data \ # map directory containing input data
  -p 8888:8888 \ # expose jupyter notebook on port 8888
  start-notebook.sh --NotebookApp.token='' \ # start with token disabled
  carjed/helmsman

You may also clone this repository and build the dockerfile locally, using the following commands:

git clone https://github.com/carjed/helmsman.git
cd helmsman

docker build -t latest --force-rm .

docker run -d --name helmsman \
  -p 8888:8888 \
  start-notebook.sh --NotebookApp.token='' \
  helmsman

Quick Start

Suppose we have a Variant Call format (VCF) file named input.vcf, containing the genotypes of N individuals at each somatic mutation identified. You will also need the corresponding reference genome. With the following command, With the following command, Helmsman will parse the VCF file and write a file under /path/to/output/ containing the Nx96 mutation spectra matrix:

python helmsman.py --input /path/to/input.vcf --fastafile /path/to/reference_genome.fasta --projectdir /path/to/output/

Citation

If you use Helmsman in your research, please cite our paper published in BMC Genomics:

Carlson J, Li JZ, Zöllner S. Helmsman: fast and efficient mutation signature analysis for massive sequencing datasets. BMC Genomics. 2018;19:845. doi:10.1186/s12864-018-5264-y


The Helmsman mascot was designed by Robert James Russell—view more of his work at http://www.robertjamesrussell.com/art/ and follow him on Twitter at @robhollywood!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].