All Projects → SystemsGenetics → gene-oracle

SystemsGenetics / gene-oracle

Licence: MIT License
Feature extraction algorithm for genomic data

Programming Languages

python
139335 projects - #7 most used programming language
Nextflow
61 projects
shell
77523 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to gene-oracle

pyrpipe
Reproducible bioinformatics pipelines in python. Import any Unix tool/command in python.
Stars: ✭ 53 (+307.69%)
Mutual labels:  bioinformatics, rna-seq
CoNekT
CoNekT (short for Co-expression Network Toolkit) is a platform to browse co-expression data and enable cross-species comparisons.
Stars: ✭ 17 (+30.77%)
Mutual labels:  bioinformatics, rna-seq
MetaOmGraph
MetaOmGraph: a workbench for interactive exploratory data analysis of large expression datasets
Stars: ✭ 30 (+130.77%)
Mutual labels:  bioinformatics, rna-seq
CellO
CellO: Gene expression-based hierarchical cell type classification using the Cell Ontology
Stars: ✭ 34 (+161.54%)
Mutual labels:  bioinformatics, rna-seq
haystack bio
Haystack: Epigenetic Variability and Transcription Factor Motifs Analysis Pipeline
Stars: ✭ 42 (+223.08%)
Mutual labels:  gene, rna-seq
snpsea
📊 Identify cell types and pathways affected by genetic risk loci.
Stars: ✭ 26 (+100%)
Mutual labels:  bioinformatics, gene
picardmetrics
🚦 Run Picard on BAM files and collate 90 metrics into one file.
Stars: ✭ 38 (+192.31%)
Mutual labels:  bioinformatics, rna-seq
slamdunk
Streamlining SLAM-seq analysis with ultra-high sensitivity
Stars: ✭ 24 (+84.62%)
Mutual labels:  bioinformatics, rna-seq
biolink-api
API for linked biological knowledge
Stars: ✭ 54 (+315.38%)
Mutual labels:  bioinformatics, gene
go4bio
Golang for Bioinformatics
Stars: ✭ 27 (+107.69%)
Mutual labels:  bioinformatics
tailseeker
Software for measuring poly(A) tail length and 3′-end modifications using a high-throughput sequencer
Stars: ✭ 17 (+30.77%)
Mutual labels:  rna-seq
simplesam
Simple pure Python SAM parser and objects for working with SAM records
Stars: ✭ 50 (+284.62%)
Mutual labels:  bioinformatics
sc2-illumina-pipeline
Bioinformatics pipeline for SARS-CoV-2 sequencing at CZ Biohub
Stars: ✭ 18 (+38.46%)
Mutual labels:  bioinformatics
genetic-algorithms
This repository belongs to the youtube videos "What are Genetic Algorithms" (https://youtu.be/uQj5UNhCPuo) and "Genetic Algorithm from Scratch in Python" (https://youtu.be/nhT56blfRpE). If you haven't seen it, please consider watching part one of this series, to get a better understanding of this code.
Stars: ✭ 79 (+507.69%)
Mutual labels:  genetic-algorithm
2017 2018-single-cell-RNA-sequencing-Workshop-UCD UCB UCSF
2017_2018 single cell RNA sequencing Workshop UCD_UCB_UCSF
Stars: ✭ 31 (+138.46%)
Mutual labels:  bioinformatics
RNArtistCore
A Kotlin DSL and library to create and plot RNA 2D structures
Stars: ✭ 20 (+53.85%)
Mutual labels:  bioinformatics
TorchGA
Train PyTorch Models using the Genetic Algorithm with PyGAD
Stars: ✭ 47 (+261.54%)
Mutual labels:  genetic-algorithm
tftargets
🎯 Human transcription factor target genes.
Stars: ✭ 77 (+492.31%)
Mutual labels:  bioinformatics
alevin-fry
🐟 🔬🦀 alevin-fry is an efficient and flexible tool for processing single-cell sequencing data, currently focused on single-cell transcriptomics and feature barcoding.
Stars: ✭ 78 (+500%)
Mutual labels:  rna-seq
saffrontree
SaffronTree: Reference free rapid phylogenetic tree construction from raw read data
Stars: ✭ 17 (+30.77%)
Mutual labels:  bioinformatics

Gene Oracle

This repository contains the code for the Gene Oracle project. Gene Oracle is an ongoing research effort to discover biomarker genes using gene expression data. Gene Oracle identifies gene sets which provide the most predictive power, based on how well they classify samples in a gene expression dataset.

For more information, refer to the paper: Uncovering biomarker genes with enriched classification potential from Hallmark gene sets

Installation

All of Gene Oracle's dependencies can be installed via Anaconda. On a shared system (such as a university research cluster), it is recommended that you install everything in an Anaconda environment:

conda env create -f environment.yml

You must then "activate" your environment in order to use it:

conda activate gene-oracle

# use gene-oracle

conda deactivate

After that, simply clone this repository to use Gene Oracle.

git clone https://github.com/SystemsGenetics/gene-oracle.git

# run the example
cd gene-oracle
scripts/run-example.sh

Usage

Gene Oracle consists of two phases, (1) gene set analysis and (2) gene subset analysis. This process encompasses multiple scripts which are run in sequence. The easiest way to learn how to run these scripts, as well as the input / output data involved, is to run the example script as shown above. It demonstrates how to run Gene Oracle on synthetic input data from make-inputs.py.

Model Configuration

Gene Oracle can use any classifier provided by scikit-learn, as well as a custom neural network (implemented in TensorFlow), to evaluate gene sets. Several classifiers are defined with sensible default parameters in models.json. Consult the scikit-learn documention on to see the list of parameters for each classifier. The example run script uses a linear model, which is one of the simplest classifiers available. Other models such as the neural network or random forest may perform better but will take longer to train.

NOTE: A GPU is required only when using the mlp-tf model with tensorflow-gpu. A GPU might not provide significant speedup a over multicore CPU when training many small neural networks.

Input Data

Gene Oracle takes three primary inputs: (1) a gene expression matrix (GEM), (2) a list of sample labels, and (3) a list of gene sets. These inputs are described below.

The gene expression matrix should be a plaintext file with rows being samples and columns being genes (features). Values in each row should be separated by tabs.

	Gene1	Gene2	Gene3	Gene4
Sample1	0.523	0.991	0.421	0.829
Sample2	8.891	7.673	3.333	9.103
Sample3	4.444	5.551	6.102	0.013

For large GEM files, it is recommended that you convert the GEM to numpy format using convert.py from the GEMprep repo, as TSPG can load this binary format much more quickly than it does the plaintext format. The convert.py script can also transpose your GEM if it is arranged the wrong way:

bin/convert.py GEM.emx.txt GEM.emx.npy --transpose

This example will create three files: GEM.emx.npy, GEM.emx.rownames.txt, and GEM.emx.colnames.txt. The latter two files contain the row names and column names, respectively. Make sure that the rows are samples and the columns are genes!

The label file should contain a label for each sample, corresponding to something such as a condition or phenotype state for the sample. This file should contain two columns, the first being the sample names and the second being the labels. Values in each row should be separated by tabs.

Sample1	Label1
Sample2	Label2
Sample3	Label3
Sample4	Label4

The gene set list should contain the name and genes for a gene set on each line, similar to the GMT format. The gene names should be identical to those used in the GEM file. Values on each row should be separated by tabs.

GeneSet1	Gene1	Gene2	Gene3
GeneSet2	Gene2	Gene4	Gene5	Gene6

Phase 1: Gene Set Analysis

The script phase1-evaluate.py takes a list of gene sets and evaluates each gene set by training and evaluating a classifier on the input dataset with only the genes in the set. This script can also evaluate the entire set of genes in the input dataset, as well as random gene sets.

The script phase1-select.py takes evaluation results for gene sets and compares them to results for random sets of equal size. It uses Student's t-test to determine the statistical significance of a gene set's score as compared to a null distribution for the given set size. Larger gene sets tend to yield higher classification accuracies, so the t-test is used to eliminate this bias when selecting gene sets for subset analysis.

Phase 2: Gene Subset Analysis

The script phase2-evaluate.py takes a list of gene sets and evaluates subsets of each gene set in order to determine the most salient genes in the gene set. This script can also analyze random gene sets in the same manner.

The script phase2-select.py takes evaluation results for the subsets selected by the previous script, measures the saliency of each gene by how frequently it appeared in all subsets, and separates "candidate" genes from "non-candidate" genes according to a threshold.

Nextflow

Dependencies

This repository also provides a Nextflow pipeline for running Gene Oracle. All you need is nextflow, Docker, and nvidia-docker. On HPC systems, you can use Singularity in lieu of Docker. If for some reason you can't use either container software, you will have to install Gene Oracle and its dependencies on your local machine.

Input Data

The nextflow pipeline assumes you have your input data arranged as follows:

input/
  {dataset1}.emx.txt
  {dataset1}.labels.txt
  {dataset2}.emx.txt
  {dataset2}.labels.txt
  ...
  {genesets1}.genesets.txt
  {genesets2}.genesets.txt
  ...

This way, you can place as many gene subsets and datasets and the pipeline will process all of them in a single run.

Usage

Here is a basic usage:

nextflow run systemsgenetics/gene-oracle -profile <conda|docker|singularity>

This example will download this pipeline to your machine and use the default nextflow.config in this repo. It will assume that you have Gene Oracle installed natively, and it will process all input files in the input directory, saving all output files to the output directory, as defined in nextflow.config.

You can also create your own nextflow.config file; nextflow will check for a config file in your current directory before defaulting to config file in this repo. You will most likely need to customize this config file as it provides options such as which experiments to run, how many chunks to use where applicable, and various other command-line parameters for Gene Oracle. The config file also allows you to define your own "profiles" for running this pipeline in different environments. Consult the Nextflow documentation for more information on what environments are supported.

You can resume a failed run with the -resume flag. Consult the Nextflow documentation for more information on these and other options.

Kubernetes

You can run this pipeline, as well as any other Nextflow pipeline, on a Kubernetes cluster with minimal effort. Consult the kube-runner repo for a command-line approach and Nextflow-API for a browser-based approach.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].