All Projects → bioinform → Somaticseq

bioinform / Somaticseq

Licence: bsd-2-clause
An ensemble approach to accurately detect somatic mutations using SomaticSeq

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Somaticseq

SigProfilerSimulator
SigProfilerSimulator allows realistic simulations of mutational patterns and mutational signatures in cancer genomes. The tool can be used to simulate signatures of single point mutations, double point mutations, and insertion/deletions. Further, the tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.
Stars: ✭ 18 (-84.87%)
Mutual labels:  cancer-genomics
pathway-mapper
PathwayMapper: An interactive and collaborative graphical curation tool for cancer pathways
Stars: ✭ 47 (-60.5%)
Mutual labels:  cancer-genomics
Getting Started With Genomics Tools And Resources
Unix, R and python tools for genomics and data science
Stars: ✭ 587 (+393.28%)
Mutual labels:  cancer-genomics
IMPACT-Pipeline
Framework to process and call somatic variation from NGS dataset generated using MSK-IMPACT assay
Stars: ✭ 52 (-56.3%)
Mutual labels:  cancer-genomics
TeamTeri
Genomics using open source tools, running on GCP or AWS
Stars: ✭ 30 (-74.79%)
Mutual labels:  cancer-genomics
SigProfilerExtractor
SigProfilerExtractor allows de novo extraction of mutational signatures from data generated in a matrix format. The tool identifies the number of operative mutational signatures, their activities in each sample, and the probability for each signature to cause a specific mutation type in a cancer sample. The tool makes use of SigProfilerMatrixGen…
Stars: ✭ 86 (-27.73%)
Mutual labels:  cancer-genomics
SigProfilerMatrixGenerator
SigProfilerMatrixGenerator creates mutational matrices for all types of somatic mutations. It allows downsizing the generated mutations only to parts for the genome (e.g., exome or a custom BED file). The tool seamlessly integrates with other SigProfiler tools.
Stars: ✭ 68 (-42.86%)
Mutual labels:  cancer-genomics
Music2
identifying mutational significance in cancer genomes
Stars: ✭ 49 (-58.82%)
Mutual labels:  cancer-genomics
SigProfilerPlotting
SigProfilerPlotting provides a standard tool for displaying all types of mutational signatures as well as all types of mutational patterns in cancer genomes. The tool seamlessly integrates with other SigProfiler tools.
Stars: ✭ 31 (-73.95%)
Mutual labels:  cancer-genomics
Cbioportal
cBioPortal for Cancer Genomics
Stars: ✭ 362 (+204.2%)
Mutual labels:  cancer-genomics
maui
Multi-omics Autoencoder Integration: Deep learning-based heterogenous data analysis toolkit
Stars: ✭ 42 (-64.71%)
Mutual labels:  cancer-genomics
cpsr
Cancer Predisposition Sequencing Reporter (CPSR)
Stars: ✭ 44 (-63.03%)
Mutual labels:  cancer-genomics
civic-client
Web client for CIViC: Clinical Interpretations of Variants in Cancer
Stars: ✭ 49 (-58.82%)
Mutual labels:  cancer-genomics
orchid
A novel management, annotation, and machine learning framework for analyzing cancer mutations
Stars: ✭ 29 (-75.63%)
Mutual labels:  cancer-genomics
Sv Callers
Snakemake-based workflow for detecting structural variants in WGS data
Stars: ✭ 28 (-76.47%)
Mutual labels:  cancer-genomics
civic-server
Backend Server for CIViC Project
Stars: ✭ 39 (-67.23%)
Mutual labels:  cancer-genomics
revolver
REVOLVER - Repeated Evolution in Cancer
Stars: ✭ 52 (-56.3%)
Mutual labels:  cancer-genomics
Msisensor
microsatellite instability detection using tumor only or paired tumor-normal data
Stars: ✭ 103 (-13.45%)
Mutual labels:  cancer-genomics
Agfusion
Python package to annotate and visualize gene fusions.
Stars: ✭ 36 (-69.75%)
Mutual labels:  cancer-genomics
Pygeno
Personalized Genomics and Proteomics. Main diet: Ensembl, side dishes: SNPs
Stars: ✭ 261 (+119.33%)
Mutual labels:  cancer-genomics

SomaticSeq

Requirements

This dockerfile reveals the dependencies

  • Python 3, plus pysam, numpy, scipy, pandas, and xgboost libraries.
  • BEDTools: required when parallel processing is invoked, and/or when any bed files are used as input files.
  • At least one of the callers we have incorporated, i.e., MuTect2 (GATK4) / MuTect / Indelocator, VarScan2, JointSNVMix2, SomaticSniper, VarDict, MuSE, LoFreq, Scalpel, Strelka2, TNscope, and/or Platypus. SomaticSeq relies on 3rd-party caller(s) to generate mutation candidates, so you have to run at least one of them, but preferably multiple.
  • Optional: dbSNP VCF file (if you want to use dbSNP membership as a feature).
  • Optional: R and ada are required for AdaBoost, whereas XGBoost is implemented in python.
  • To install SomaticSeq, clone this repo, cd somaticseq, and then run ./setup.py install.

To install from github source with conda

conda create --name my_environment -c bioconda python bedtools
conda activate my_environment
git clone [email protected]:bioinform/somaticseq.git
cd somaticseq
./setup.py install

To install the bioconda version

SomaticSeq can also be found on Anaconda-Server Badge. To install with bioconda, which also automatically installs a bunch of 3rd-party somatic mutation callers: conda install -c bioconda somaticseq.

Test your installation

There are some toy data sets and test scripts in example that should finish in <1 minute if installed properly.

Run SomaticSeq with an example command

  • At minimum, given the results of the individual mutation caller(s), SomaticSeq will extract sequencing features for the combined call set. Required inputs are --output-directory, --genome-reference, paired|single, --tumor-bam-file, and --normal-bam-file. Everything else is optional (though without a single VCF file from at least one caller, SomaticSeq will have nothing to do).

  • The following four files will be created into the output directory:

    • Consensus.sSNV.vcf, Consensus.sINDEL.vcf, Ensemble.sSNV.tsv, and Ensemble.sINDEL.tsv.
  • If you're searching for pipelines to run those individual somatic mutation callers, feel free to take advantage of our Dockerized Somatic Mutation Workflow.

# Merge caller results and extract SomaticSeq features
somaticseq_parallel.py \
--output-directory  $OUTPUT_DIR \
--genome-reference  GRCh38.fa \
--inclusion-region  genome.bed \
--exclusion-region  blacklist.bed \
--algorithm         xgboost \
--threads           24 \
paired \
--tumor-bam-file    tumor.bam \
--normal-bam-file   matched_normal.bam \
--mutect2-vcf       MuTect2/variants.vcf \
--varscan-snv       VarScan2/variants.snp.vcf \
--varscan-indel     VarScan2/variants.indel.vcf \
--jsm-vcf           JointSNVMix2/variants.snp.vcf \
--somaticsniper-vcf SomaticSniper/variants.snp.vcf \
--vardict-vcf       VarDict/variants.vcf \
--muse-vcf          MuSE/variants.snp.vcf \
--lofreq-snv        LoFreq/variants.snp.vcf \
--lofreq-indel      LoFreq/variants.indel.vcf \
--scalpel-vcf       Scalpel/variants.indel.vcf \
--strelka-snv       Strelka/variants.snv.vcf \
--strelka-indel     Strelka/variants.indel.vcf
  • --inclusion-region or --exclusion-region will require BEDTools in your path.
  • --algorithm will default to xgboost as v3.6.0, but can also be ada (AdaBoost in R). XGBoost supports multi-threading and can be orders of magnitude faster than AdaBoost, and seems to be about the same in terms of accuracy, so we changed the default from ada to xgboost as v3.6.0.
  • To split the job into multiple threads, place --threads X before the paired option to indicate X threads. It simply creates multiple BED file (each consisting of 1/X of total base pairs) for SomaticSeq to run on each of those sub-BED files in parallel. It then merges the results. This requires bedtools in your path.
  • For all input VCF files, either .vcf or .vcf.gz are acceptable.

Additional parameters to be specified before paired option to invoke training mode. In addition to the four files specified above, two classifiers (SNV and indel) will be created..

  • --somaticseq-train: FLAG to invoke training mode with no argument, which also requires ground truth VCF files as follows:
  • --truth-snv: if you have a ground truth VCF file for SNV
  • --truth-indel: if you have a ground truth VCF file for INDEL

Additional input files to be specified before paired option invoke prediction mode (to use classifiers to score variants). Four additional files will be created, i.e., SSeq.Classified.sSNV.vcf, SSeq.Classified.sSNV.tsv, SSeq.Classified.sINDEL.vcf, and SSeq.Classified.sINDEL.tsv.

  • --classifier-snv: classifier previously built for SNV
  • --classifier-indel: classifier previously built for INDEL

Without those paramters above to invoking training or prediction mode, SomaticSeq will default to majority-vote consensus mode.

Do not worry if Python throws the following warning. This occurs when SciPy attempts a statistical test with empty data, e.g., z-scores between reference- and variant-supporting reads will be NaN if there is no reference read at a position.

  RuntimeWarning: invalid value encountered in double_scalars
  z = (s - expected) / np.sqrt(n1*n2*(n1+n2+1)/12.0)

Run SomaticSeq modules seperately

Most SomaticSeq modules can be run on their own. They may be useful in debugging context, or be run for your own purposes. See this page for your options.

Dockerized workflows and pipelines

To run somatic mutation callers and then SomaticSeq

We have created a module (i.e., makeSomaticScripts.py) that can run all the dockerized somatic mutation callers and then SomaticSeq, described at somaticseq/utilities/dockered_pipelines. There is also an alignment workflow described there. You need docker to run these workflows. Singularity is also supported, but is not optimized.

To create training data to create SomaticSeq classifiers

We have also dockerized pipelines for in silico mutation spike in at somaticseq/utilities/dockered_pipelines/bamSimulator. These pipelines are based on BAMSurgeon. We have used it to create training set to build SomaticSeq classifiers, though it has not been updated for a while.

GATK's best practices for alignment

Described at somaticseq/utilities/dockered_pipelines. The module is makeAlignmentScripts.py.

Additional workflows

Video tutorial

This 8-minute video was created for SomaticSeq v1.0. The details are slightly outdated, but the main points remain the same.

SomaticSeq Video

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].