All Projects → jostmey → Dkm

jostmey / Dkm

Licence: lgpl-3.0
Dynamic Kernel Matching (DKM) for Classifying Data with Non-conforming Features

Projects that are alternatives of or similar to Dkm

Gatk
Official code repository for GATK versions 4 and up
Stars: ✭ 1,002 (+1065.12%)
Mutual labels:  genomics
Dna Nn
Model and predict short DNA sequence features with neural networks
Stars: ✭ 59 (-31.4%)
Mutual labels:  genomics
Sibeliaz
A fast whole-genome aligner based on de Bruijn graphs
Stars: ✭ 76 (-11.63%)
Mutual labels:  genomics
Sns
Analysis pipelines for sequencing data
Stars: ✭ 43 (-50%)
Mutual labels:  genomics
Genozip
Compressor for genomic files (FASTQ, SAM/BAM, VCF, FASTA, GVF, 23andMe...), up to 5x better than gzip and faster too
Stars: ✭ 53 (-38.37%)
Mutual labels:  genomics
Bluegenes
A friendly next-generation interface for Genomic data discovery powered by InterMine
Stars: ✭ 66 (-23.26%)
Mutual labels:  genomics
Awesome Biological Visualizations
A list of web-based interactive biological data visualizations.
Stars: ✭ 40 (-53.49%)
Mutual labels:  genomics
Awesome 10x Genomics
List of tools and resources related to the 10x Genomics GEMCode/Chromium system
Stars: ✭ 82 (-4.65%)
Mutual labels:  genomics
Mixomics
Development repository for the Bioconductor package 'mixOmics '
Stars: ✭ 58 (-32.56%)
Mutual labels:  genomics
Fastq.bio
An interactive web tool for quality control of DNA sequencing data
Stars: ✭ 76 (-11.63%)
Mutual labels:  genomics
Quota Alignment
Guided synteny alignment between duplicated genomes (within specified quota constraint)
Stars: ✭ 47 (-45.35%)
Mutual labels:  genomics
Aioli
Framework for building fast genomics web tools with WebAssembly and WebWorkers
Stars: ✭ 51 (-40.7%)
Mutual labels:  genomics
Gubbins
Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins
Stars: ✭ 67 (-22.09%)
Mutual labels:  genomics
Jigv
igv.js server and automatic configuration to view bam/cram/vcf/bed. "working in under 1 minute"
Stars: ✭ 43 (-50%)
Mutual labels:  genomics
Svtyper
Bayesian genotyper for structural variants
Stars: ✭ 79 (-8.14%)
Mutual labels:  genomics
Radiator
RADseq Data Exploration, Manipulation and Visualization using R
Stars: ✭ 40 (-53.49%)
Mutual labels:  genomics
Deep Review
A collaboratively written review paper on deep learning, genomics, and precision medicine
Stars: ✭ 1,141 (+1226.74%)
Mutual labels:  genomics
Igv Snapshot Automator
Script to automatically create and run IGV snapshot batchscripts
Stars: ✭ 83 (-3.49%)
Mutual labels:  genomics
Scoary
Pan-genome wide association studies
Stars: ✭ 80 (-6.98%)
Mutual labels:  genomics
Bgt
Flexible genotype query among 30,000+ samples whole-genome
Stars: ✭ 72 (-16.28%)
Mutual labels:  genomics

Modelling Data that Do Not Conform to Rows and Columns: A Case Study of T-cell Receptor Datasets

Publication submitted for peer-review
JARED L OSTMEYER, ASSISTANT PROFESSOR, UT SOUTHWESTERN DEPARTMENT OF POPULATION AND DATA SCIENCES

Introduction

Statistical classifiers are mathematical models that use example data to find patterns in features that predict a label. Most statistical classifiers assume the features are arranged into rows and columns, like a spreadsheet, but many kinds of data do not conform to this structure. Sequences are one example of a different kind of data, which is why this data is usually stored in a text document, not a spreadsheet. To build statistical classifiers for sequences and other non-conforming features, we have developed what we call dynamic kernel matching (DKM).

DKM is analogous to max-pooling in a convolutional network, but for sequences instead of convolutions. Consider the problem of classifying a sequence. Because some sequences are longer than others, the number of features is irregular. Given a specific sequence, the challenge is to determine the appropriate permutation of features with weights, allowing us to run the features through the statistical classifier to generate a prediction. We use a sequence alignment algorithm to find the permutation of features that exhibit the maximal response, like how max pooling finds an image patch that exhibits the maximal response in a convolutional network. Given the immense number of possible permutations between features and weights, the problem appears computationally complex but can be solved in polynomial time using a sequence alignment algorithm. Here, we implement the Needleman-Wunsch algorithm (link) in TensorFlow. Equivalents to a sequence alignment algorithm exist for (i) sets, (ii) trees, and (iii) graphs, making it possible to use DKM on non-conforming features represented by these structures (Unlike sequence alignment, the general problem of graph alignment is considered NP-hard).

To illustrate the types of non-conforming features that we can handle with DKM, we consider two datasets of T-cell receptors, anticipating these datasets to contain signatures for diagnosing disease.

alt text

Antigen Classification Problem

10x Genomics has published a dataset of sequenced T-cell receptors labelled by interaction with disease particles, which are called antigens. We refer to this as the antigen classification problem. To solve the antigen classification problem, we use DKM to classify each sequence in this dataset. See the folder antigen-classification-problem for details (link).

Repertoire Classification Problem

Adaptive Biotechnologies has published a separate dataset of patients' sequenced T-cell receptors, which are called immune repertoires, labelled by those patients' cytomegalovirus (CMV) serostatus. We refer to this as the repertoire classification problem. To solve the repertoire classification problem, we use DKM to classify each set of sequences in this dataset. See the folder repertoire-classification-problem for details (link).

Study Design

The training cohort is used to fit a model, the validation cohort is used for model selection, and the test cohort is used for reporting results. We strictly adhere to this protocol, ensuring that we avoid a model selection bias when reporting results.

Requirements

Recommended Tools

Download

  • Download: zip
  • Git: git clone https://github.com/jostmey/dkm
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].