Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → greenelab → Multi Plier

greenelab / Multi Plier

Licence: other

An unsupervised transfer learning approach for rare disease transcriptomics

Labels

html machine-learning dataset analysis methodology

Projects that are alternatives of or similar to Multi Plier

snorkeling

Extracting biomedical relationships from literature with Snorkel 🏊

Stars: ✭ 56 (+69.7%)

Mutual labels: analysis, dataset, methodology

Daps

Denoising Autoencoders for Phenotype Stratification

Stars: ✭ 39 (+18.18%)

Mutual labels: analysis, methodology

Pancancer

Building classifiers using cancer transcriptomes across 33 different cancer-types

Stars: ✭ 84 (+154.55%)

Mutual labels: analysis, methodology

shared-latent-space

Shared Latent Space VAE's

Stars: ✭ 15 (-54.55%)

Mutual labels: analysis, methodology

Feversymmetric

Symmetric evaluation set based on the FEVER (fact verification) dataset

Stars: ✭ 29 (-12.12%)

Mutual labels: dataset

Nba api

An API Client package to access the APIs for NBA.com

Stars: ✭ 881 (+2569.7%)

Mutual labels: analysis

Recruit

这个项目的目的是整合招聘信息，并做一定处理。

Stars: ✭ 13 (-60.61%)

Mutual labels: analysis

Company Names Corpus

公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。

Stars: ✭ 868 (+2530.3%)

Mutual labels: dataset

Rstudioconf tweets

🖥 A repository for tracking tweets about rstudio::conf

Stars: ✭ 32 (-3.03%)

Mutual labels: dataset

Universityrecruitment Ssurvey

用严肃的数据来回答“什么样的企业会到什么样的大学招聘”？

Stars: ✭ 30 (-9.09%)

Mutual labels: analysis

Jsut Lab

HTS-style full-context labels for JSUT v1.1

Stars: ✭ 28 (-15.15%)

Mutual labels: dataset

Covid 19 Api

Covid-19 Virus Data API from Johns Hopkins CSSE

Stars: ✭ 15 (-54.55%)

Mutual labels: dataset

Camoco

Camoco is a fully-fledged software package for building co-expression networks and analyzing the overlap interactions among genes.

Stars: ✭ 29 (-12.12%)

Mutual labels: analysis

Tedsds

Apache Spark - Turbofan Engine Degradation Simulation Data Set example in Apache Spark

Stars: ✭ 14 (-57.58%)

Mutual labels: dataset

Elastic data

Elasticsearch datasets ready for bulk loading

Stars: ✭ 30 (-9.09%)

Mutual labels: dataset

Designcourse

Course materials for "Research Design in Political Science"

Stars: ✭ 12 (-63.64%)

Mutual labels: methodology

Awesome Ai In Finance

🔬 A curated list of awesome machine learning strategies & tools in financial market.

Stars: ✭ 910 (+2657.58%)

Mutual labels: analysis

Day night dataset list

Collecting a list of dataset with day and night annotations

Stars: ✭ 30 (-9.09%)

Mutual labels: dataset

Microstate Eeglab Toolbox

Microstate EEGlab toolbox

Stars: ✭ 21 (-36.36%)

Mutual labels: analysis

Dotnet Assembly Grapher

Reverse engineering and software quality assurance tool for .NET assemblies

Stars: ✭ 21 (-36.36%)

Mutual labels: analysis

View All Similar Projects ➔

MultiPLIER

A unsupervised transfer learning approach for rare disease transcriptomics

Taroni JN, Grayson PC, Hu Q, Eddy S, Kretzler M, Merkel PA, and Greene CS⁺. MultiPLIER: a transfer learning framework reveals systemic features of rare autoimmune disease. bioRxiv. 2018.

⁺Correspondence via issues or to [email protected]

Data

Data used in this analysis repo were processed in greenelab/rheum-plier-data. Please see that repository for relevant citations.

Data and code, including items that are too large to be stored with Git LFS (e.g., some models), associated with v0.2.0 are available at the following DOI: 10.6084/m9.figshare.6982919.v2

Dependencies

We have prepared a Docker image that contains all the dependencies required to reproduce these analyses. See docker/Dockerfile for more information about dependencies.

After installing Docker (Docker documentation), the image can be obtained:

docker pull jtaroni/multi-plier:0.2.0

We use R notebooks for analysis, which can be run and modified using RStudio. RStudio is included on our Docker image. This guide from Andrew Heiss, specifically the Run locally with a GUI section, is a great starting point for working with RStudio and Docker.

Overview

Unsupervised machine learning methods provide a promising means to analyze and interpret large datasets. However, most datasets generated by individual researchers remain too small to fully benefit from these methods. In the case of rare diseases, there may be too few cases available, even when multiple studies are combined. We sought to determine whether or not machine learning models could be constructed from large public data compendia and then transferred to small datasets for subsequent analysis. We trained models using Pathway Level Information ExtractoR (PLIER) (Github) over datasets of different types and scales. Models constructed from large public datasets were i) more detailed than those constructed from individual datasets; ii) included features that aligned well to important biological factors; iii) transferrable to rare disease datasets where the models describe biological processes related to disease severity more effectively than models trained within those datasets.

We call this approach MultiPLIER because we train on multiple datasets, tissues, and biological conditions.

We focus on groups of systemic autoimmune conditions in this project; one group of conditions is rare and the other disease is not. First, we establish that PLIER is appropriate for use in a single tissue, multi-dataset compendium (greenelab/rheum-plier-data/sle-wb) constructed from publicly available systemic lupus erythematosus (SLE) whole blood (WB) microarray data. We demonstrate that MultiPLIER, trained on the recount2 RNA-seq compendium, performs similarly in capturing certain cell type-specific signals and captures additional pathway signals over an SLE WB model. We also analyze expression data from 3 tissues from anti-neutrophilic cytoplasmic antibodies (ANCA)-associated vasculitis (AAV), a family of rare diseases, with MultiPLIER.

Overview figure of dataset-specific PLIER and MultiPLIER. Boxes with solid colored fills represent inputs to the model. White boxes with colored outlines represent model output. (A) PLIER (Mao et al., 2017) automatically extracts latent variables (LVs), shown as the matrix B, and their loadings (Z). We can train PLIER model for each of three datasets from different tissues, which results in three dataset-specific latent spaces. (B) PLIER takes as input a prior information/knowledge matrix C and applies a constraint such that some of the loadings (Z) and therefore some of the LVs capture biological signal in the form of curated pathways or cell type-specific gene sets. (C) Ideally, an LV will map to a single gene set or a group of highly related gene sets to allow for easy interpretation of the model. PLIER applies a penalty on U to facilitate this. Purple fill in a cell indicates a non-zero value and a darker purple indicates a higher value. We show an undesirable U matrix in the top toy example (Ci) and a favorable U matrix in the bottom toy example (Cii). (D) If models have been trained on individual datasets, we may be required to find “matching” LVs in different dataset- or tissue-specific models using the loadings (Z) from each model. Using a metric like the Pearson correlation between loadings, we may or may not be able to find a well-correlated match between datasets. (E) The MultiPLIER approach: train a PLIER on a large collection of uniformly processed data from many different biological contexts and conditions (recount2; Collado-Torres et al., 2017)—a MultiPLIER model—and then project the individual datasets into the MultiPLIER latent space. The hatched fill indicates the sample dataset of origin. (F) Latent variables from the MultiPLIER model can be tested for differential expression between disease and controls in multiple tissues.

For more information about the training set, please see this notebook.

Notebooks

Analysis notebooks are numbered and present in the top level directory. We've enabled Github pages for easy viewing of the notebooks. Some steps in the pipeline are R scripts rather than notebooks due to their computationally intensive nature; we exclude these from the TOC below.

Note that not all analyses present in this repository are included in the preprint.

The figure_notebooks directory contains notebooks that were used specifically to generate figures suitable for publication (figure_notebooks/figures).

License

This repository is dual licensed as BSD 3-Clause (source code) and CC0 1.0 (figures, documentation, and our arrangement of the facts contained in the underlying data), with the following exceptions:

recount2 data is licensed CC-BY.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 33

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗