All Projects → ropenscilabs → Umapr

ropenscilabs / Umapr

Licence: other
UMAP dimensionality reduction in R

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to Umapr

Rgbif
Interface to the Global Biodiversity Information Facility API
Stars: ✭ 113 (-1.74%)
Mutual labels:  r-package, rstats
Gistr
Interact with GitHub gists from R
Stars: ✭ 90 (-21.74%)
Mutual labels:  r-package, rstats
Git2rdata
An R package for storing and retrieving data.frames in git repositories.
Stars: ✭ 84 (-26.96%)
Mutual labels:  r-package, rstats
Starters
R Package 📦 for initializing projects for various R activities 🔩
Stars: ✭ 111 (-3.48%)
Mutual labels:  r-package, rstats
Pkgverse
📦🔭🌠 Create your own universe of packages à la tidyverse
Stars: ✭ 108 (-6.09%)
Mutual labels:  r-package, rstats
Spelling
Tools for Spell Checking in R
Stars: ✭ 82 (-28.7%)
Mutual labels:  r-package, rstats
Drake
An R-focused pipeline toolkit for reproducibility and high-performance computing
Stars: ✭ 1,301 (+1031.3%)
Mutual labels:  r-package, rstats
Feddata
Functions to Automate Downloading Geospatial Data Available from Several Federated Data Sources
Stars: ✭ 70 (-39.13%)
Mutual labels:  r-package, rstats
Elevatr
An R package for accessing elevation data
Stars: ✭ 95 (-17.39%)
Mutual labels:  r-package, rstats
Ckanr
R client for the CKAN API
Stars: ✭ 91 (-20.87%)
Mutual labels:  r-package, rstats
Qcoder
Lightweight package to do qualitative coding
Stars: ✭ 82 (-28.7%)
Mutual labels:  r-package, rstats
Rorcid
A programmatic interface the Orcid.org API
Stars: ✭ 101 (-12.17%)
Mutual labels:  r-package, rstats
Pkgsearch
Search R packages on CRAN
Stars: ✭ 73 (-36.52%)
Mutual labels:  r-package, rstats
Rzmq
R package for ZMQ
Stars: ✭ 83 (-27.83%)
Mutual labels:  r-package, rstats
Gsodr
Global Surface Summary of the Day ('GSOD') Weather Data Client for R
Stars: ✭ 72 (-37.39%)
Mutual labels:  r-package, rstats
Trackmd
Tools for tracking changes in Markdown format within RStudio
Stars: ✭ 89 (-22.61%)
Mutual labels:  r-package, rstats
Lexisnexistools
📰 Working with newspaper data from 'LexisNexis'
Stars: ✭ 59 (-48.7%)
Mutual labels:  r-package, rstats
Sysreqs
R package to install system requirements
Stars: ✭ 63 (-45.22%)
Mutual labels:  r-package, rstats
Refmanager
R package RefManageR
Stars: ✭ 90 (-21.74%)
Mutual labels:  r-package, rstats
Monkeylearn
⛔️ ARCHIVED ⛔️ 🐒 R package for text analysis with Monkeylearn 🐒
Stars: ✭ 95 (-17.39%)
Mutual labels:  r-package, rstats

umapr

Travis-CI Build Status AppVeyor Build Status codecov

umapr wraps the Python implementation of UMAP to make the algorithm accessible from within R. It uses the great reticulate package.

Uniform Manifold Approximation and Projection (UMAP) is a non-linear dimensionality reduction algorithm. It is similar to t-SNE but computationally more efficient. UMAP was created by Leland McInnes and John Healy (github, arxiv).

Recently, two new UMAP R packages have appeared. These new packages provide more features than umapr does and they are more actively developed. These packages are:

  • umap, which provides the same Python wrapping function as umapr and also an R implementation, removing the need for the Python version to be installed. It is available on CRAN.

  • uwot, which also provides an R implementation, removing the need for the Python version to be installed.

Contributors

Angela Li, Ju Kim, Malisa Smith, Sean Hughes, Ted Laderas

umapr is a project that was first developed at rOpenSci Unconf 2018.

Installation

First, you will need to install Python and the UMAP package. Instruction available here.

Then, you can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("ropenscilabs/umapr")

Basic use

Here is an example of running UMAP on the iris data set.

library(umapr)
library(tidyverse)

# select only numeric columns
df <- as.matrix(iris[ , 1:4])

# run UMAP algorithm
embedding <- umap(df)

umap returns a data.frame with two attached columns called "UMAP1" and "UMAP2". These columns represent the UMAP embeddings of the data, which are column-bound to the original data frame.

# look at result
head(embedding)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width    UMAP1     UMAP2
#> 1          5.1         3.5          1.4         0.2 5.647059 -6.666872
#> 2          4.9         3.0          1.4         0.2 4.890193 -8.130815
#> 3          4.7         3.2          1.3         0.2 4.397037 -7.546669
#> 4          4.6         3.1          1.5         0.2 4.412886 -7.633424
#> 5          5.0         3.6          1.4         0.2 5.707233 -6.863213
#> 6          5.4         3.9          1.7         0.4 6.442851 -5.726554

# plot the result
embedding %>% 
  mutate(Species = iris$Species) %>%
  ggplot(aes(UMAP1, UMAP2, color = Species)) + geom_point()

There is a function called run_umap_shiny() which will bring up a Shiny app for exploring different colors of the variables on the umap plots.

run_umap_shiny(embedding)

Shiny App for Exploring Results

Function parameters

There are a few important parameters. These are fully described in the UMAP Python documentation.

The n_neighbor argument can range from 2 to n-1 where n is the number of rows in the data.

neighbors <- c(4, 8, 16, 32, 64, 128)



neighbors %>% 
  map_df(~umap(as.matrix(iris[,1:4]), n_neighbors = .x) %>% 
      mutate(Species = iris$Species, Neighbor = .x)) %>% 
  mutate(Neighbor = as.integer(Neighbor)) %>% 
  ggplot(aes(UMAP1, UMAP2, color = Species)) + 
    geom_point() + 
    facet_wrap(~ Neighbor, scales = "free")

The min_dist argument can range from 0 to 1.

dists <- c(0.001, 0.01, 0.05, 0.1, 0.5, 0.99)

dists %>% 
  map_df(~umap(as.matrix(iris[,1:4]), min_dist = .x) %>% 
      mutate(Species = iris$Species, Distance = .x)) %>% 
  ggplot(aes(UMAP1, UMAP2, color = Species)) + 
    geom_point() + 
    facet_wrap(~ Distance, scales = "free")

The distance argument can be many different distance functions.

dists <- c("euclidean", "manhattan", "canberra", "cosine", "hamming", "dice")

dists %>% 
  map_df(~umap(as.matrix(iris[,1:4]), metric = .x) %>% 
      mutate(Species = iris$Species, Metric = .x)) %>% 
  ggplot(aes(UMAP1, UMAP2, color = Species)) + 
    geom_point() + 
    facet_wrap(~ Metric, scales = "free")

Comparison to t-SNE and principal components analysis

t-SNE and UMAP are both non-linear dimensionality reduction methods, in contrast to PCA. Because t-SNE is relatively slow, PCA is sometimes run first to reduce the dimensions of the data.

We compared UMAP to PCA and t-SNE alone, as well as to t-SNE run on data preprocessed with PCA. In each case, the data were subset to include only complete observations. The code to reproduce these findings are available in timings.R.

The first data set is the same iris data set used above (149 observations of 4 variables):

t-SNE, PCA, and UMAP on iris

Next we tried a cancer data set, made up of 699 observations of 10 variables:

t-SNE, PCA, and UMAP on cancer

Third we tried a soybean data set. It is made up of 531 observations and 35 variables:

t-SNE, PCA, and UMAP on soybeans

Finally we used a large single-cell RNAsequencing data set, with 561 observations (cells) of 55186 variables (over 30 million elements)!

t-SNE, PCA, and UMAP on rna

PCA is orders of magnitude faster than t-SNE or UMAP (not shown). UMAP, though, is a substantial improvement over t-SNE both in terms of memory and time taken to run.

Time to run t-SNE vs UMAP

Memory to run t-SNE vs UMAP

Related projects

  • umap: R implementation of UMAP
  • seurat: R toolkit for single cell genomics
  • smallvis: R package for dimensionality reduction of small datasets
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].