All Projects → nathan-russell → hashmap

nathan-russell / hashmap

Licence: other
Faster hash maps in R

Programming Languages

C++
36643 projects - #6 most used programming language
r
7636 projects
c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to hashmap

r-sparsepp
Rcpp interface to sparsepp - efficient hash map for C++
Stars: ✭ 12 (-83.33%)
Mutual labels:  rcpp, hashmap
rTRNG
R package providing access and examples to TRNG C++ library
Stars: ✭ 17 (-76.39%)
Mutual labels:  rcpp
rcppredis
R interface to Redis using the hiredis library
Stars: ✭ 45 (-37.5%)
Mutual labels:  rcpp
indicium
🔎 A simple in-memory search for collections and key-value stores.
Stars: ✭ 41 (-43.06%)
Mutual labels:  hashmap
bh
R package providing Boost Header files
Stars: ✭ 73 (+1.39%)
Mutual labels:  rcpp
ctl
My variant of the C Template Library
Stars: ✭ 105 (+45.83%)
Mutual labels:  hashmap
gmwm
Generalized Method of Wavelet Moments (GMWM) is an estimation technique for the parameters of time series models. It uses the wavelet variance in a moment matching approach that makes it particularly suitable for the estimation of certain state-space models.
Stars: ✭ 21 (-70.83%)
Mutual labels:  rcpp
rcppfastfloat
Rcpp Bindings for the 'fastfloat' Header-Only Library
Stars: ✭ 18 (-75%)
Mutual labels:  rcpp
rcpp progress
RcppProgress R package: An interruptible progress bar with OpenMP support for c++ in R packages
Stars: ✭ 26 (-63.89%)
Mutual labels:  rcpp
RcppML
Rcpp Machine Learning: Fast robust NMF, divisive clustering, and more
Stars: ✭ 52 (-27.78%)
Mutual labels:  rcpp
joineRML
R package for fitting joint models to time-to-event data and multivariate longitudinal data
Stars: ✭ 24 (-66.67%)
Mutual labels:  rcpp
rcppgsl
Rcpp integration for GNU GSL vectors and matrices
Stars: ✭ 28 (-61.11%)
Mutual labels:  rcpp
textTinyR
Text Processing for Small or Big Data Files in R
Stars: ✭ 32 (-55.56%)
Mutual labels:  rcpp
colourvalues
R library for assigning colours to values
Stars: ✭ 41 (-43.06%)
Mutual labels:  rcpp
ABCoptim
An implementation of the Artificial Bee Colony (ABC) Algorithm
Stars: ✭ 26 (-63.89%)
Mutual labels:  rcpp
rcppensmallen
Rcpp integration for the Ensmallen templated C++ mathematical optimization library
Stars: ✭ 28 (-61.11%)
Mutual labels:  rcpp
data-structure-project
自己实现集合框架系列整理总结
Stars: ✭ 29 (-59.72%)
Mutual labels:  hashmap
sfheaders
Build sf objects from R and Rcpp
Stars: ✭ 70 (-2.78%)
Mutual labels:  rcpp
Notes
This is a learning note | Java基础,JVM,源码,大数据,面经
Stars: ✭ 69 (-4.17%)
Mutual labels:  hashmap
needle
📌📚 An extensive standalone data structure library for JavaScript.
Stars: ✭ 25 (-65.28%)
Mutual labels:  hashmap

hashmap

Travis-CI Build Status MIT licensed CRAN_Status_Badge

Motivation

Unlike many programming languages, R does not implement a native hash table class. The typical workaround is to use environments, taking advantage of the fact that these objects are, by default, internally hashed:

EE <- new.env(hash = TRUE)  # equivalent to new.env()

set.seed(123)
list2env(
    setNames(
        as.list(rnorm(26)), 
        LETTERS
    ),
    envir = EE
)

EE[["A"]]
# [1] -0.5604756

EE[["D"]]
# [1] 0.07050839

EE[["Z"]]
# [1] -1.686693

In many situations, this is a fine solution - lookups are reasonably fast, and environments are highly flexible, allowing one to store virtually any type of R object (functions, lists, other environments, etc.). However, one of the major downsides to using envinronments as hash tables is the inability to work with vector arguments:

EE[[c("A", "B")]]
# Error in EE[[c("A", "B")]] : 
#   wrong arguments for subsetting an environment

EE[c("A", "B")]
# Error in EE[c("A", "B")] : 
#   object of type 'environment' is not subsettable

This is unfortunate, and somewhat surprising, considering most operations in R have vectorized semantics.


Solution

library(hashmap)

set.seed(123)
(HH <- hashmap(LETTERS, rnorm(26)))
## (character) => (numeric)  
##         [Z] => [-1.686693]
##         [Y] => [-0.625039]
##         [R] => [-1.966617]
##         [X] => [-0.728891]
##         [Q] => [+0.497850]
##         [P] => [+1.786913]
##       [...] => [...] 

HH[[c("A", "B")]]
# [1] -0.5604756 -0.2301775

It is important to note that unlike the environment-based solution, hashmap does NOT offer the flexibilty to store arbitrary types of objects. Any combination of the following atomic vector types is currently permitted:

  • keys
    • integer
    • numeric
    • character
    • Date
    • POSIXct
  • values
    • logical
    • integer
    • numeric
    • character
    • complex
    • Date
    • POSIXct

Features

What hashmap may lack in terms of flexibility it makes up for in two important areas: performance and ease-of-use. Let's begin with the latter by looking at some basic examples.

Usage

  • A Hashmap is created by passing a vector of keys and a vector of values to hashmap:

    set.seed(123)
    H <- hashmap(letters[1:10], rnorm(10))
    H
    ## (character) => (numeric)  
    ##         [j] => [-0.445662]
    ##         [i] => [-0.686853]
    ##         [h] => [-1.265061]
    ##         [g] => [+0.460916]
    ##         [e] => [+0.129288]
    ##         [d] => [+0.070508]
    ##       [...] => [...] 
  • If the lengths of the two vectors are not equal, the longer object is truncated to the length of its counterpart, and a warning is issued:

    hashmap(letters[1:5], 1:3)
    ## (character) => (integer)
    ##         [c] => [3]      
    ##         [b] => [2]      
    ##         [a] => [1]      
    # Warning message:
    # In new_CppObject_xp(fields$.module, fields$.pointer, ...) :
    #   length(keys) != length(values)!
    
    hashmap(letters[1:3], 1:5)
    ## (character) => (integer)
    ##         [c] => [3]      
    ##         [b] => [2]      
    ##         [a] => [1]      
    # Warning message:
    # In new_CppObject_xp(fields$.module, fields$.pointer, ...) :
    #   length(keys) != length(values)!
  • Value lookup can be performed by passing a vector of lookup keys to either of [[ or $find:

    H[["a"]]
    # [1] -0.5604756
    
    H$find("b")
    # [1] -0.2301775
    
    H[[c("a", "c")]]
    # [1] -0.5604756  1.5587083
    
    H$find(c("b", "d"))
    # [1] -0.23017749  0.07050839
  • For non-existant lookup keys, NA is returned:

    H[[c("a", "A", "b")]]
    # [1] -0.5604756         NA -0.2301775
  • Use $has_key to check for the existance of individual keys, or $has_keys for a vector of keys:

    H$has_key("a")
    # [1] TRUE
    
    H$has_key("A")
    # [1] FALSE
    
    H$has_keys(c("a", "A", "b", "B"))
    # [1]  TRUE FALSE  TRUE FALSE
  • Modification of key-value pairs is done using either of [[<- or $insert. For non-existing keys, a new key-value pair will be inserted. For existing keys, the previous value will be overwritten:

    H[[c("a", "x")]]
    # [1] -0.5604756         NA
    
    H[[c("a", "x")]] <- c(1.5, 26.5)
    H[[c("a", "x")]]
    # [1]  1.5 26.5
    
    H$insert(c("a", "y", "z"), c(100, 200, 300))
    H[[c("a", "y", "z")]]
    # [1] 100 200 300
  • To remove elements from the hash table, pass a vector of keys to $erase, which will delete entries for matched elements, and do nothing otherwise:

    H$has_keys(c("y", "Y", "z", "Z"))
    # [1]  TRUE FALSE  TRUE FALSE
    
    H$erase(c("y", "Y", "z", "Z"))
    
    H$has_keys(c("y", "Y", "z", "Z"))
    # [1] FALSE FALSE FALSE FALSE
  • Use $size to check the number of key-value pairs, $empty to check if the hash table is empty, and $clear to delete all existing entries:

    H$size()
    # [1] 11
    
    H$empty()
    # [1] FALSE
    
    H$clear()
    
    H$empty()
    # [1] TRUE
    
    H$size()
    # [1] 0
    
    H
    ## [empty Hashmap]
  • $keys and $values return every key and value, respectively, and $data returns a named vector of values, using the keys as names:

    H[[c("A", "B", "C")]] <- 1:3
    
    H$keys()
    # [1] "C" "B" "A"
    
    H$values()
    # [1] 3 2 1
    
    H$data()
    # C B A 
    # 3 2 1 
  • By default, only the first 6 key-value pairs of a Hashmap are printed, where [...] => [...] indicates that additional entries exist but are not displayed. This can be adjusted via options():

    getOption("hashmap.max.print")
    # [1] 6
    
    H
    ## (character) => (numeric)  
    ##         [C] => [+3.000000]
    ##         [B] => [+2.000000]
    ##         [A] => [+1.000000]
    
    H[[letters[1:10]]] <- rnorm(10)
    H
    ## (character) => (numeric)  
    ##         [j] => [-0.472791]
    ##         [i] => [+0.701356]
    ##         [h] => [-1.966617]
    ##         [g] => [+0.497850]
    ##         [e] => [-0.555841]
    ##         [d] => [+0.110683]
    ##       [...] => [...]
    
    options(hashmap.max.print = 15)
    H
    ## (character) => (numeric)  
    ##         [j] => [-0.472791]
    ##         [i] => [+0.701356]
    ##         [h] => [-1.966617]
    ##         [g] => [+0.497850]
    ##         [e] => [-0.555841]
    ##         [d] => [+0.110683]
    ##         [c] => [+0.400772]
    ##         [f] => [+1.786913]
    ##         [b] => [+0.359814]
    ##         [a] => [+1.224082]
    ##         [C] => [+3.000000]
    ##         [B] => [+2.000000]
    ##         [A] => [+1.000000]

Benchmark

The following is a simple test comparing the performance of an environment object against hashmap for

  1. Construction of the hash table
  2. Vectorized key lookup

An overview of results in presented here, but the full code to reproduce the test is in assets/benchmark.R. All of the examples use a one million element character vector for keys, and a one million element numeric vector for values.

Hash table construction was rather slow for the environment, despite my best moderate efforts to devise a fast solution, so expressions were only evaluated 25 times:

microbenchmark::microbenchmark(
    "Hash" = hashmap(Keys, Values),
    "Env" = env_hash(Keys, Values),
    times = 25L
)
# Unit: milliseconds
#  expr        min        lq      mean    median       uq       max neval cld
#  Hash   946.3524  1287.771  1784.404  1639.788  2243.93  3315.194    25   a 
#   Env 11724.2705 13218.521 14071.874 13685.929 15178.27 16516.216    25   b

Next, a lookup of all 1000 keys:

E <- env_hash(Keys, Values)
H <- hashmap(Keys, Values)

all.equal(env_find(Lookup, E), H[[Lookup]])
# [1] TRUE

microbenchmark::microbenchmark(
    "Hash" = H[[Lookup]],
    "Env" = env_find(Lookup, E), 
    times = 500L
)
# Unit: microseconds
#  expr       min       lq       mean     median         uq       max neval cld
#  Hash   314.182   738.98   804.5154   799.7065   858.3895  3013.285   500   a 
#   Env 12291.671 12651.12 13020.3816 12740.1735 12919.7355 67220.784   500   b

And finally, a comparison of key-lookups for vectors of various sizes, plotted below on the linear and logarithmic scales, where data points represent median evaluation time of 200 runs for the given expression:


The benchmark was conducted on a laptop running Ubuntu 14.04, with the following specs,

$ lscpu && printf "\n\n" && free -h
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 69
Stepping:              1
CPU MHz:               759.000
BogoMIPS:              4589.34
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              3072K
NUMA node0 CPU(s):     0-3


             total       used       free     shared    buffers     cached
Mem:          7.7G       5.6G       2.1G       333M       499M       2.5G
-/+ buffers/cache:       2.6G       5.1G
Swap:           0B         0B         0B

in the following R session:

R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.6   hashmap_0.0.0.9000 ggvis_0.4.2       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4.1          rstudioapi_0.3.1       knitr_1.11             magrittr_1.5          
 [5] munsell_0.4.2          colorspace_1.2-6       xtable_1.8-2           R6_2.1.1              
 [9] plyr_1.8.3             dplyr_0.4.3            tools_3.2.4            parallel_3.2.4        
[13] grid_3.2.4             gtable_0.1.2           DBI_0.3.1              htmltools_0.3.5       
[17] yaml_2.1.13            lazyeval_0.1.10        assertthat_0.1         digest_0.6.8          
[21] shiny_0.13.2           ggplot2_2.0.0          microbenchmark_1.4-2.1 codetools_0.2-14      
[25] mime_0.4               rmarkdown_0.8.1        scales_0.3.0           jsonlite_0.9.17       
[29] httpuv_1.3.3           chron_2.3-47 

Installation

The stable release of hashmap can be installed from CRAN:

install.packages("hashmap")

The current development version can be installed from GitHub with devtools:

if (!"devtools" %in% installed.packages()[,1]) {
    install.packages("devtools")
}
devtools::install_github("nathan-russell/hashmap")
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].