privefl / bigstatsr

Licence: other

R package for statistical tools with big matrices stored on disk.

Programming Languages

7636 projects

C++

36643 projects - #6 most used programming language

Projects that are alternatives of or similar to bigstatsr

Geni

A Clojure dataframe library that runs on Spark

Stars: ✭ 152 (+9.35%)

Mutual labels: big-data, parallel-computing

Datumbox Framework

Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

Stars: ✭ 1,063 (+664.75%)

Mutual labels: statistics, big-data

Datascience Ai Machinelearning Resources

Alex Castrounis' curated set of resources for artificial intelligence (AI), machine learning, data science, internet of things (IoT), and more.

Stars: ✭ 414 (+197.84%)

Mutual labels: statistics, big-data

Gtsummary

Presentation-Ready Data Summary and Analytic Result Tables

Stars: ✭ 450 (+223.74%)

Mutual labels: statistics, r-package

Arsenal

An Arsenal of 'R' Functions for Large-Scale Statistical Summaries

Stars: ✭ 171 (+23.02%)

Mutual labels: statistics, r-package

Spark With Python

Fundamentals of Spark with Python (using PySpark), code examples

Stars: ✭ 150 (+7.91%)

Mutual labels: big-data, parallel-computing

Onlinestats.jl

Single-pass algorithms for statistics

Stars: ✭ 507 (+264.75%)

Mutual labels: statistics, big-data

Sdc

Intel® Scalable Dataframe Compiler for Pandas*

Stars: ✭ 623 (+348.2%)

Mutual labels: big-data, parallel-computing

Mlr

Machine Learning in R

Stars: ✭ 1,542 (+1009.35%)

Mutual labels: statistics, r-package

Tennis Crystal Ball

Ultimate Tennis Statistics and Tennis Crystal Ball - Tennis Big Data Analysis and Prediction

Stars: ✭ 107 (-23.02%)

Mutual labels: statistics, big-data

Projpred

Projection predictive variable selection

Stars: ✭ 76 (-45.32%)

Mutual labels: statistics, r-package

Data Science Live Book

An open source book to learn data science, data analysis and machine learning, suitable for all ages!

Stars: ✭ 193 (+38.85%)

Mutual labels: statistics, big-data

Animation

A gallery of animations in statistics and utilities to create animations

Stars: ✭ 173 (+24.46%)

Mutual labels: statistics, r-package

xcast

A High-Performance Data Science Toolkit for the Earth Sciences

Stars: ✭ 28 (-79.86%)

Mutual labels: big-data, parallel-computing

digest

R package to create compact hash digests of R objects

Stars: ✭ 94 (-32.37%)

Mutual labels: r-package

roadoi

Use Unpaywall with R

Stars: ✭ 60 (-56.83%)

Mutual labels: r-package

mode-line-stats

A bunch of easy to set up stats for the Emacs mode-line.

Stars: ✭ 27 (-80.58%)

Mutual labels: statistics

ntuthesis

台大碩博士論文模板 (R Package)

Stars: ✭ 14 (-89.93%)

Mutual labels: r-package

renko trend following strategy catalyst

Example of adaptive trend following strategy based on Renko

Stars: ✭ 65 (-53.24%)

Mutual labels: statistics

GeostatisticsLessonsNotebooks

These are python notebooks accompanying Lessons available at GeostatisticsLessons.com

Stars: ✭ 28 (-79.86%)

Mutual labels: statistics

View All Similar Projects ➔

bigstatsr

R package {bigstatsr} provides functions for fast statistical analysis of large-scale data encoded as matrices. The package can handle matrices that are too large to fit in memory thanks to memory-mapping to binary files on disk. This is very similar to the format big.matrix provided by R package {bigmemory}, which is no longer used by this package (see the corresponding vignette). As inputs, package {bigstatsr} uses Filebacked Big Matrices (FBM).

LIST OF FEATURES

Note that most of the algorithms of this package don't handle missing values.

Installation

# For the CRAN version
install.packages("bigstatsr")
# For the latest version
remotes::install_github("privefl/bigstatsr")

Small example

library(bigstatsr)

# Create the data on disk
X <- FBM(5e3, 10e3, backingfile = "test")$save()
# If you open a new session you can do
X <- big_attach("test.rds")

# Fill it by chunks with random values
U <- matrix(0, nrow(X), 5); U[] <- rnorm(length(U))
V <- matrix(0, ncol(X), 5); V[] <- rnorm(length(V))
NCORES <- nb_cores()
# X = U V^T + E
big_apply(X, a.FUN = function(X, ind, U, V) {
  X[, ind] <- tcrossprod(U, V[ind, ]) + rnorm(nrow(X) * length(ind))
  NULL  ## you don't want to return anything here
}, a.combine = 'c', ncores = NCORES, U = U, V = V)
# Check some values
X[1:5, 1:5]

# Compute first 10 PCs
obj.svd <- big_randomSVD(X, fun.scaling = big_scale(), 
                         k = 10, ncores = NCORES)
plot(obj.svd)

# Cleanup
unlink(paste0("test", c(".bk", ".rds")))

Learn more with this introduction to package {bigstatsr}.

If you want to use Rcpp code, look at this tutorial.

Some use cases

Parallelization

Package {bigstatsr} uses package {foreach} for its parallelization tasks. Learn more on parallelism with {foreach} with this tutorial.

Large datasets

Computing the null space of a big matrix (works if one dimension is not too large)
Rowwise matrix multiplication
Operating with a big.matrix

Bug report / Help

How to make a great R reproducible example?

Please open an issue if you find a bug.

If you want help using {bigstatsr}, please open an issue as well or post on Stack Overflow with the tag bigstatsr.

I will always redirect you to GitHub issues if you email me, so that others can benefit from our discussion.

References

Privé, Florian, et al. "Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr." Bioinformatics 34.16 (2018): 2781-2787.
Privé, Florian, Hugues Aschard, and Michael GB Blum. "Efficient implementation of penalized regression for genetic risk prediction." Genetics 212.1 (2019): 65-74.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

privefl / bigstatsr

Programming Languages

Labels

Projects that are alternatives of or similar to bigstatsr

bigstatsr

Installation

Small example

Some use cases

Parallelization

Large datasets

Bug report / Help

References