Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → bioinform → Somaticseq

bioinform / Somaticseq

Licence: bsd-2-clause

An ensemble approach to accurately detect somatic mutations using SomaticSeq

Programming Languages

139335 projects - #7 most used programming language

Labels

cancer-genomics

Projects that are alternatives of or similar to Somaticseq

SigProfilerSimulator

SigProfilerSimulator allows realistic simulations of mutational patterns and mutational signatures in cancer genomes. The tool can be used to simulate signatures of single point mutations, double point mutations, and insertion/deletions. Further, the tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.

Stars: ✭ 18 (-84.87%)

Mutual labels: cancer-genomics

PathwayMapper: An interactive and collaborative graphical curation tool for cancer pathways

Stars: ✭ 47 (-60.5%)

Mutual labels: cancer-genomics

Getting Started With Genomics Tools And Resources

Unix, R and python tools for genomics and data science

Stars: ✭ 587 (+393.28%)

Mutual labels: cancer-genomics

IMPACT-Pipeline

Framework to process and call somatic variation from NGS dataset generated using MSK-IMPACT assay

Stars: ✭ 52 (-56.3%)

Mutual labels: cancer-genomics

Genomics using open source tools, running on GCP or AWS

Stars: ✭ 30 (-74.79%)

Mutual labels: cancer-genomics

SigProfilerExtractor

SigProfilerExtractor allows de novo extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of SigProfilerMatrixGen…

Stars: ✭ 86 (-27.73%)

Mutual labels: cancer-genomics

SigProfilerMatrixGenerator

SigProfilerMatrixGenerator creates mutational matrices for all types of somatic mutations. It allows downsizing the generated mutations only to parts for the genome (e.g., exome or a custom BED file). The tool seamlessly integrates with other SigProfiler tools.

Stars: ✭ 68 (-42.86%)

Mutual labels: cancer-genomics

identifying mutational significance in cancer genomes

Stars: ✭ 49 (-58.82%)

Mutual labels: cancer-genomics

SigProfilerPlotting

SigProfilerPlotting provides a standard tool for displaying all types of mutational signatures as well as all types of mutational patterns in cancer genomes. The tool seamlessly integrates with other SigProfiler tools.

Stars: ✭ 31 (-73.95%)

Mutual labels: cancer-genomics

cBioPortal for Cancer Genomics

Stars: ✭ 362 (+204.2%)

Mutual labels: cancer-genomics

Multi-omics Autoencoder Integration: Deep learning-based heterogenous data analysis toolkit

Stars: ✭ 42 (-64.71%)

Mutual labels: cancer-genomics

Cancer Predisposition Sequencing Reporter (CPSR)

Stars: ✭ 44 (-63.03%)

Mutual labels: cancer-genomics

Web client for CIViC: Clinical Interpretations of Variants in Cancer

Stars: ✭ 49 (-58.82%)

Mutual labels: cancer-genomics

A novel management, annotation, and machine learning framework for analyzing cancer mutations

Stars: ✭ 29 (-75.63%)

Mutual labels: cancer-genomics

Snakemake-based workflow for detecting structural variants in WGS data

Stars: ✭ 28 (-76.47%)

Mutual labels: cancer-genomics

Backend Server for CIViC Project

Stars: ✭ 39 (-67.23%)

Mutual labels: cancer-genomics

REVOLVER - Repeated Evolution in Cancer

Stars: ✭ 52 (-56.3%)

Mutual labels: cancer-genomics

microsatellite instability detection using tumor only or paired tumor-normal data

Stars: ✭ 103 (-13.45%)

Mutual labels: cancer-genomics

Python package to annotate and visualize gene fusions.

Stars: ✭ 36 (-69.75%)

Mutual labels: cancer-genomics

Personalized Genomics and Proteomics. Main diet: Ensembl, side dishes: SNPs

Stars: ✭ 261 (+119.33%)

Mutual labels: cancer-genomics

View All Similar Projects ➔

SomaticSeq

SomaticSeq is an ensemble caller that has the ability to use machine learning to filter out false positives. The detailed documentation is included in the repo, located in docs/Manual.pdf.
SomaticSeq's open-access paper: Fang LT, Afshar PT, Chhibber A, et al. An ensemble approach to accurately detect somatic mutations using SomaticSeq. Genome Biol. 2015;16:197.
Feel free to report issues and/or ask questions at the Issues page.
The v2 branch is still supported, but it's severely limited comparing to the current versions.

Requirements

This dockerfile reveals the dependencies

Python 3, plus pysam, numpy, scipy, pandas, and xgboost libraries.
BEDTools: required when parallel processing is invoked, and/or when any bed files are used as input files.
At least one of the callers we have incorporated, i.e., MuTect2 (GATK4) / MuTect / Indelocator, VarScan2, JointSNVMix2, SomaticSniper, VarDict, MuSE, LoFreq, Scalpel, Strelka2, TNscope, and/or Platypus. SomaticSeq relies on 3rd-party caller(s) to generate mutation candidates, so you have to run at least one of them, but preferably multiple.
Optional: dbSNP VCF file (if you want to use dbSNP membership as a feature).
Optional: R and ada are required for AdaBoost, whereas XGBoost is implemented in python.
To install SomaticSeq, clone this repo, cd somaticseq, and then run ./setup.py install.

To install from github source with conda

conda create --name my_environment -c bioconda python bedtools
conda activate my_environment
git clone [email protected]:bioinform/somaticseq.git
cd somaticseq
./setup.py install

To install the bioconda version

SomaticSeq can also be found on . To , which also automatically installs a bunch of 3rd-party somatic mutation callers: conda install -c bioconda somaticseq.

Test your installation

There are some toy data sets and test scripts in example that should finish in <1 minute if installed properly.

Run SomaticSeq with an example command

At minimum, given the results of the individual mutation caller(s), SomaticSeq will extract sequencing features for the combined call set. Required inputs are --output-directory, --genome-reference, paired|single, --tumor-bam-file, and --normal-bam-file. Everything else is optional (though without a single VCF file from at least one caller, SomaticSeq will have nothing to do).
The following four files will be created into the output directory:
- Consensus.sSNV.vcf, Consensus.sINDEL.vcf, Ensemble.sSNV.tsv, and Ensemble.sINDEL.tsv.
If you're searching for pipelines to run those individual somatic mutation callers, feel free to take advantage of our Dockerized Somatic Mutation Workflow.

# Merge caller results and extract SomaticSeq features
somaticseq_parallel.py \
--output-directory  $OUTPUT_DIR \
--genome-reference  GRCh38.fa \
--inclusion-region  genome.bed \
--exclusion-region  blacklist.bed \
--algorithm         xgboost \
--threads           24 \
paired \
--tumor-bam-file    tumor.bam \
--normal-bam-file   matched_normal.bam \
--mutect2-vcf       MuTect2/variants.vcf \
--varscan-snv       VarScan2/variants.snp.vcf \
--varscan-indel     VarScan2/variants.indel.vcf \
--jsm-vcf           JointSNVMix2/variants.snp.vcf \
--somaticsniper-vcf SomaticSniper/variants.snp.vcf \
--vardict-vcf       VarDict/variants.vcf \
--muse-vcf          MuSE/variants.snp.vcf \
--lofreq-snv        LoFreq/variants.snp.vcf \
--lofreq-indel      LoFreq/variants.indel.vcf \
--scalpel-vcf       Scalpel/variants.indel.vcf \
--strelka-snv       Strelka/variants.snv.vcf \
--strelka-indel     Strelka/variants.indel.vcf

--inclusion-region or --exclusion-region will require BEDTools in your path.
--algorithm will default to xgboost as v3.6.0, but can also be ada (AdaBoost in R). XGBoost supports multi-threading and can be orders of magnitude faster than AdaBoost, and seems to be about the same in terms of accuracy, so we changed the default from ada to xgboost as v3.6.0.
To split the job into multiple threads, place --threads X before the paired option to indicate X threads. It simply creates multiple BED file (each consisting of 1/X of total base pairs) for SomaticSeq to run on each of those sub-BED files in parallel. It then merges the results. This requires bedtools in your path.
For all input VCF files, either .vcf or .vcf.gz are acceptable.

Additional parameters to be specified before paired option to invoke training mode. In addition to the four files specified above, two classifiers (SNV and indel) will be created..

--somaticseq-train: FLAG to invoke training mode with no argument, which also requires ground truth VCF files as follows:
--truth-snv: if you have a ground truth VCF file for SNV
--truth-indel: if you have a ground truth VCF file for INDEL

Additional input files to be specified before paired option invoke prediction mode (to use classifiers to score variants). Four additional files will be created, i.e., SSeq.Classified.sSNV.vcf, SSeq.Classified.sSNV.tsv, SSeq.Classified.sINDEL.vcf, and SSeq.Classified.sINDEL.tsv.

--classifier-snv: classifier previously built for SNV
--classifier-indel: classifier previously built for INDEL

Without those paramters above to invoking training or prediction mode, SomaticSeq will default to majority-vote consensus mode.

Do not worry if Python throws the following warning. This occurs when SciPy attempts a statistical test with empty data, e.g., z-scores between reference- and variant-supporting reads will be NaN if there is no reference read at a position.

  RuntimeWarning: invalid value encountered in double_scalars
  z = (s - expected) / np.sqrt(n1*n2*(n1+n2+1)/12.0)

Run SomaticSeq modules seperately

Most SomaticSeq modules can be run on their own. They may be useful in debugging context, or be run for your own purposes. See this page for your options.

Dockerized workflows and pipelines

To run somatic mutation callers and then SomaticSeq

We have created a module (i.e., makeSomaticScripts.py) that can run all the dockerized somatic mutation callers and then SomaticSeq, described at somaticseq/utilities/dockered_pipelines. There is also an alignment workflow described there. You need docker to run these workflows. Singularity is also supported, but is not optimized.

To create training data to create SomaticSeq classifiers

We have also dockerized pipelines for in silico mutation spike in at somaticseq/utilities/dockered_pipelines/bamSimulator. These pipelines are based on BAMSurgeon. We have used it to create training set to build SomaticSeq classifiers, though it has not been updated for a while.

GATK's best practices for alignment

Described at somaticseq/utilities/dockered_pipelines. The module is makeAlignmentScripts.py.

Additional workflows

A Snakemake workflow to run the somatic mutation callers and SomaticSeq was created by Afif Elghraoui at somaticseq/utilities/snakemake. It needs to be updated to work.

Video tutorial

This 8-minute video was created for SomaticSeq v1.0. The details are slightly outdated, but the main points remain the same.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 119

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗