All Projects → Bioconductor → GenomicDataCommons

Bioconductor / GenomicDataCommons

Licence: other
Provide R access to the NCI Genomic Data Commons portal.

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to GenomicDataCommons

wdlRunR
Elastic, reproducible, and reusable genomic data science tools from R backed by cloud resources
Stars: ✭ 34 (-46.87%)
Mutual labels:  bioinformatics, genomics, bioconductor
MultiAssayExperiment
Bioconductor package for management of multi-assay data
Stars: ✭ 57 (-10.94%)
Mutual labels:  genomics, tcga, bioconductor
smartas
📓Notebook of Climente-González et al. (2017), The Functional Impact of Alternative Splicing in Cancer.
Stars: ✭ 13 (-79.69%)
Mutual labels:  genomics, cancer, tcga
shiny-iatlas
An interactive web portal for exploring immuno-oncology data
Stars: ✭ 43 (-32.81%)
Mutual labels:  genomics, cancer
unimap
A EXPERIMENTAL fork of minimap2 optimized for assembly-to-reference alignment
Stars: ✭ 76 (+18.75%)
Mutual labels:  bioinformatics, genomics
dysgu
dysgu-SV is a collection of tools for calling structural variants using short or long reads
Stars: ✭ 47 (-26.56%)
Mutual labels:  bioinformatics, genomics
simplesam
Simple pure Python SAM parser and objects for working with SAM records
Stars: ✭ 50 (-21.87%)
Mutual labels:  bioinformatics, genomics
ntHash
Fast hash function for DNA sequences
Stars: ✭ 66 (+3.13%)
Mutual labels:  bioinformatics, genomics
saffrontree
SaffronTree: Reference free rapid phylogenetic tree construction from raw read data
Stars: ✭ 17 (-73.44%)
Mutual labels:  bioinformatics, genomics
companion
This repository has been archived, currently maintained version is at https://github.com/iii-companion/companion
Stars: ✭ 21 (-67.19%)
Mutual labels:  bioinformatics, genomics
calN50
Compute N50/NG50 and auN/auNG
Stars: ✭ 20 (-68.75%)
Mutual labels:  bioinformatics, genomics
full spectrum bioinformatics
An open-access bioinformatics text
Stars: ✭ 26 (-59.37%)
Mutual labels:  bioinformatics, genomics
wgs2ncbi
Toolkit for preparing genomes for submission to NCBI
Stars: ✭ 25 (-60.94%)
Mutual labels:  bioinformatics, genomics
Scaff10X
Pipeline for scaffolding and breaking a genome assembly using 10x genomics linked-reads
Stars: ✭ 21 (-67.19%)
Mutual labels:  bioinformatics, genomics
psichomics
Interactive R package to quantify, analyse and visualise alternative splicing
Stars: ✭ 26 (-59.37%)
Mutual labels:  tcga, bioconductor
faster lmm d
A faster lmm for GWAS. Supports GPU backend.
Stars: ✭ 12 (-81.25%)
Mutual labels:  bioinformatics, genomics
jgi-query
A simple command-line tool to download data from Joint Genome Institute databases
Stars: ✭ 38 (-40.62%)
Mutual labels:  bioinformatics, genomics
cancer-data
TCGA data acquisition and processing for Project Cognoma
Stars: ✭ 17 (-73.44%)
Mutual labels:  cancer, tcga
catch
A package for designing compact and comprehensive capture probe sets.
Stars: ✭ 55 (-14.06%)
Mutual labels:  bioinformatics, genomics
staramr
Scans genome contigs against the ResFinder, PlasmidFinder, and PointFinder databases.
Stars: ✭ 52 (-18.75%)
Mutual labels:  bioinformatics, genomics

R-CMD-check

What is the GDC?

From the Genomic Data Commons (GDC) website:

The National Cancer Institute's (NCI's) Genomic Data Commons (GDC) is a data sharing platform that promotes precision medicine in oncology. It is not just a database or a tool; it is an expandable knowledge network supporting the import and standardization of genomic and clinical data from cancer research programs.

The GDC contains NCI-generated data from some of the largest and most comprehensive cancer genomic datasets, including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Therapies (TARGET). For the first time, these datasets have been harmonized using a common set of bioinformatics pipelines, so that the data can be directly compared.

As a growing knowledge system for cancer, the GDC also enables researchers to submit data, and harmonizes these data for import into the GDC. As more researchers add clinical and genomic data to the GDC, it will become an even more powerful tool for making discoveries about the molecular basis of cancer that may lead to better care for patients.

The data model for the GDC is complex, but it worth a quick overview. The data model is encoded as a so-called property graph. Nodes represent entities such as Projects, Cases, Diagnoses, Files (various kinds), and Annotations. The relationships between these entities are maintained as edges. Both nodes and edges may have Properties that supply instance details. The GDC API exposes these nodes and edges in a somewhat simplified set of RESTful endpoints.

Quickstart

This software is available at Bioconductor.org and can be downloaded via BiocManager::install.

To report bugs or problems, either submit a new issue or submit a bug.report(package='GenomicDataCommons') from within R (which will redirect you to the new issue on GitHub).

Installation

Installation can be achieved via Bioconductor's BiocManager package.

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install('GenomicDataCommons')
library(GenomicDataCommons)

Check basic functionality

GenomicDataCommons::status()

Find data

The following code builds a manifest that can be used to guide the download of raw data. Here, filtering finds gene expression files quantified as raw counts using HTSeq from ovarian cancer patients.

library(magrittr)
ge_manifest = files() %>% 
    filter( cases.project.project_id == 'TCGA-OV') %>%
    filter( type == 'gene_expression' ) %>%
    filter( analysis.workflow_type == 'HTSeq - Counts') %>%
    manifest()

Download data

This code block downloads the r nrow(ge_manifest) gene expression files specified in the query above. Using multiple processes to do the download very significantly speeds up the transfer in many cases. The following completes in about 15 seconds.

library(BiocParallel)
register(MulticoreParam())
destdir = tempdir()
fnames = lapply(ge_manifest$id,gdcdata)

If the download had included controlled-access data, the download above would have needed to include a token. Details are available in the authentication section below.

Metadata queries

expands = c("diagnoses","annotations",
             "demographic","exposures")
clinResults = cases() %>% 
    GenomicDataCommons::select(NULL) %>%
    GenomicDataCommons::expand(expands) %>% 
    results(size=50)
clinDF = as.data.frame(clinResults)
library(DT)
datatable(clinDF, extensions = 'Scroller', options = list(
  deferRender = TRUE,
  scrollY = 200,
  scrollX = TRUE,
  scroller = TRUE
))

Basic design

This package design is meant to have some similarities to the "hadleyverse" approach of dplyr. Roughly, the functionality for finding and accessing files and metadata can be divided into:

  1. Simple query constructors based on GDC API endpoints.
  2. A set of verbs that when applied, adjust filtering, field selection, and faceting (fields for aggregation) and result in a new query object (an endomorphism)
  3. A set of verbs that take a query and return results from the GDC

In addition, there are exhiliary functions for asking the GDC API for information about available and default fields, slicing BAM files, and downloading actual data files. Here is an overview of functionality1.

  • Creating a query
    • projects()
    • cases()
    • files()
    • annotations()
  • Manipulating a query
    • filter()
    • facet()
    • select()
  • Introspection on the GDC API fields
    • mapping()
    • available_fields()
    • default_fields()
    • grep_fields()
    • available_values()
    • available_expand()
  • Executing an API call to retrieve query results
    • results()
    • count()
    • response()
  • Raw data file downloads
    • gdcdata()
    • transfer()
    • gdc_client()
  • Summarizing and aggregating field values (faceting)
    • aggregations()
  • Authentication
    • gdc_token()
  • BAM file slicing
    • slicing()

Footnotes

  1. See individual function and methods documentation for specific details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].