Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → carjed → Helmsman

carjed / Helmsman

Licence: mit

highly-efficient & lightweight mutation signature matrix aggregation

Programming Languages

python

139335 projects - #7 most used programming language

Labels

bioinformatics sequencing vcf

Projects that are alternatives of or similar to Helmsman

Genomics

A collection of scripts and notes related to genomics and bioinformatics

Stars: ✭ 101 (+431.58%)

Mutual labels: bioinformatics, sequencing, vcf

Truvari

Structural variant toolkit for VCFs

Stars: ✭ 85 (+347.37%)

Mutual labels: bioinformatics, sequencing, vcf

Htslib

C library for high-throughput sequencing data formats

Stars: ✭ 529 (+2684.21%)

Mutual labels: bioinformatics, vcf

saffrontree

SaffronTree: Reference free rapid phylogenetic tree construction from raw read data

Stars: ✭ 17 (-10.53%)

Mutual labels: bioinformatics, sequencing

plasmidtron

Assembling the cause of phenotypes and genotypes from NGS data

Stars: ✭ 27 (+42.11%)

Mutual labels: bioinformatics, sequencing

Sequenceserver

Intuitive local web frontend for the BLAST bioinformatics tool

Stars: ✭ 198 (+942.11%)

Mutual labels: bioinformatics, sequencing

Cyvcf2

cython + htslib == fast VCF and BCF processing

Stars: ✭ 243 (+1178.95%)

Mutual labels: bioinformatics, vcf

SVCollector

Method to optimally select samples for validation and resequencing

Stars: ✭ 20 (+5.26%)

Mutual labels: bioinformatics, vcf

Rnaseq Workflow

A repository for setting up a RNAseq workflow

Stars: ✭ 170 (+794.74%)

Mutual labels: bioinformatics, sequencing

gff3toembl

Converts Prokka GFF3 files to EMBL files for uploading annotated assemblies to EBI

Stars: ✭ 27 (+42.11%)

Mutual labels: bioinformatics, sequencing

gencore

Generate duplex/single consensus reads to reduce sequencing noises and remove duplications

Stars: ✭ 91 (+378.95%)

Mutual labels: bioinformatics, sequencing

Vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files

Stars: ✭ 259 (+1263.16%)

Mutual labels: bioinformatics, vcf

Deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

Stars: ✭ 2,404 (+12552.63%)

Mutual labels: bioinformatics, sequencing

Survivor

Toolset for SV simulation, comparison and filtering

Stars: ✭ 180 (+847.37%)

Mutual labels: bioinformatics, vcf

Hap.py

Haplotype VCF comparison tools

Stars: ✭ 249 (+1210.53%)

Mutual labels: bioinformatics, vcf

Roary

Rapid large-scale prokaryote pan genome analysis

Stars: ✭ 176 (+826.32%)

Mutual labels: bioinformatics, sequencing

catch

A package for designing compact and comprehensive capture probe sets.

Stars: ✭ 55 (+189.47%)

Mutual labels: bioinformatics, sequencing

Hail

Scalable genomic data analysis.

Stars: ✭ 706 (+3615.79%)

Mutual labels: bioinformatics, vcf

Biosyntax

Syntax highlighting for computational biology

Stars: ✭ 164 (+763.16%)

Mutual labels: bioinformatics, vcf

Afterqc

Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data

Stars: ✭ 169 (+789.47%)

Mutual labels: bioinformatics, sequencing

View All Similar Projects ➔

Announcements

2020-07-22 CRITICAL BUGFIX

In versions 1.4.4 and earlier, when using Helmsman to process tab-delimited text input (i.e., with the --mode txt option), a bug in the code caused sample labels to be incorrectly sorted with respect to the final mutation counts, which resulted in mutation counts and spectra being scrambled across sample labels.

This issue did not affect any of the other input modes (e.g., --mode vcf, --mode agg, or --mode maf), and does not appear to have impacted the results of any papers or preprints that have used Helmsman. We urge any users that are processing data with --mode txt to ensure that they are using version 1.5.0+ of Helmsman.

Thanks to @jmcbroome for finding and fixing this bug.

O be swift—
we have always known you wanted us.

(from 'The Helmsman' by Hilda Doolittle [1886 - 1961])

Introduction

Helmsman is a utility for rapidly and efficiently generating mutation spectra matrices from massive next-generation sequencing datasets, for use in a wide range of mutation signature analysis tools. See Alexandrov et al., Cell Reports, 2013 for a detailed explanation of how mutation signature analysis methods work, and why they are useful in studying cancer genomes.

Currently, the majority of mutation signature analysis methods are implemented as R packages. To generate the mutation spectra matrix, these packages must read the entire dataset (containing information about every individual mutation), into memory. This is generally not a problem for datasets containing only a few thousand SNVs or a few dozen samples, but for much larger datasets, it is extremely easy to exceed the physical memory capacity of your machine. Depending on the dimensions of the data, even very small files can take a very long time to process in these R packages--sometimes over 30 minutes for a 3Mb file!

Other mutation signature analysis tools do not even provide these convenience functions, and leave it to the user to coerce their data into program-specific formats. Not only is this inconvenient, but it impedes users' ability to port their data between tools and take advantage of the unique features provided by different packages.

Helmsman aims to alleviate these performance barriers and lack of standardization. Helmsman was initially written to evaluate patterns of variation in massive whole-genome datasets, containing tens or hundreds of millions of SNVs observed in tens of thousands of individuals. Generating the mutation spectra matrix for such data is virtually impossible with R-based implementations, but Helmsman provides a very fast and scalable solution for analyzing these datasets.

If you plan to use the output of Helmsman in other mutation signature analysis packages, Helmsman can automatically generate a small R script with all the code necessary to read the mutation spectra matrix and format it for compatibility with existing tools, using functions from the musigtools package.

Helmsman includes several other convenient features, including:

ability to pool samples together when generating the mutation spectra matrix
aggregate data from multiple VCF files
bare-bones (and really fast!) non-negative matrix factorization, to extract signatures without even relying on external packages
able to run in parallel

Setup

Using Conda (recommended)

The easiest way to start using Helmsman is to create a Conda environment, based on the dependencies specified in the env.yml file:

git clone https://github.com/carjed/helmsman.git
cd helmsman

conda env create -n helmsman -f env.yml
source activate helmsman

Using pip

If you do not have Conda on your system, the prerequisites for Helmsman can also be installed with pip inside of a python3 virtual environment (assuming you have virtualenv installed):

git clone https://github.com/carjed/helmsman.git
cd helmsman

virtualenv -p python3 helmsman_env
source helmsman_env/bin/activate

pip install -r pip_reqs.txt

It is also possible to forgo the virtual environment setup and use pip to install the necessary dependencies in your global site-packages directory, but this is not recommended as doing so may cause dependency conflicts between Helmsman and other programs/packages.

Docker

For more flexible deployment options, Helmsman is available as a Docker container. The following command will pull and run the preconfigured image from the Docker Hub:

docker run -d --name helmsman \
  -v /path/to/local/data:/data \ # map directory containing input data
  -p 8888:8888 \ # expose jupyter notebook on port 8888
  start-notebook.sh --NotebookApp.token='' \ # start with token disabled
  carjed/helmsman

You may also clone this repository and build the dockerfile locally, using the following commands:

git clone https://github.com/carjed/helmsman.git
cd helmsman

docker build -t latest --force-rm .

docker run -d --name helmsman \
  -p 8888:8888 \
  start-notebook.sh --NotebookApp.token='' \
  helmsman

Quick Start

Suppose we have a Variant Call format (VCF) file named input.vcf, containing the genotypes of N individuals at each somatic mutation identified. You will also need the corresponding reference genome. With the following command, With the following command, Helmsman will parse the VCF file and write a file under /path/to/output/ containing the Nx96 mutation spectra matrix:

python helmsman.py --input /path/to/input.vcf --fastafile /path/to/reference_genome.fasta --projectdir /path/to/output/

Citation

If you use Helmsman in your research, please cite our paper published in BMC Genomics:

Carlson J, Li JZ, Zöllner S. Helmsman: fast and efficient mutation signature analysis for massive sequencing datasets. BMC Genomics. 2018;19:845. doi:10.1186/s12864-018-5264-y

The Helmsman mascot was designed by Robert James Russell—view more of his work at http://www.robertjamesrussell.com/art/ and follow him on Twitter at @robhollywood!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 19

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (3) 🔗