All Projects → bnosac → spark.sas7bdat

bnosac / spark.sas7bdat

Licence: other
Read in SAS data in parallel into Apache Spark

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to spark.sas7bdat

mleap
R Interface to MLeap
Stars: ✭ 24 (-4%)
Mutual labels:  sparklyr
taller SparkR
Taller SparkR para las Jornadas de Usuarios de R
Stars: ✭ 12 (-52%)
Mutual labels:  sparklyr
sparkbq
Sparklyr extension package to connect to Google BigQuery
Stars: ✭ 16 (-36%)
Mutual labels:  sparklyr
graphframes
R Interface for GraphFrames
Stars: ✭ 36 (+44%)
Mutual labels:  sparklyr
sparklygraphs
Old repo for R interface for GraphFrames
Stars: ✭ 13 (-48%)
Mutual labels:  sparklyr
sas7bdat-js
Read SAS files in JavaScript. Because you always wanted to do that, right?
Stars: ✭ 27 (+8%)
Mutual labels:  sas7bdat

spark.sas7bdat

The spark.sas7bdat package allows R users working with Apache Spark to read in SAS datasets in .sas7bdat format into Spark by using the spark-sas7bdat Spark package. This allows R users to

  • load a SAS dataset in parallel into a Spark table for further processing with the sparklyr package
  • process in parallel the full SAS dataset with dplyr statements, instead of having to import the full SAS dataset in RAM (using the foreign/haven packages) and hence avoiding RAM problems of large imports

Example

The following example reads in a file called iris.sas7bdat in a table called sas_example in Spark. Do try this with bigger data on your cluster and look at the help of the sparklyr package to connect to your Spark cluster.

library(sparklyr)
library(spark.sas7bdat)
mysasfile <- system.file("extdata", "iris.sas7bdat", package = "spark.sas7bdat")

sc <- spark_connect(master = "local")
x <- spark_read_sas(sc, path = mysasfile, table = "sas_example")
x

The resulting pointer to a Spark table can be further used in dplyr statements

library(dplyr)
x %>% group_by(Species) %>%
  summarise(count = n(), length = mean(Sepal_Length), width = mean(Sepal_Width))

Installation

Install the package from CRAN.

install.packages('spark.sas7bdat')

Or install this development version from github.

devtools::install_github("bnosac/spark.sas7bdat", build_vignettes = TRUE)
vignette("spark_sas7bdat_examples", package = "spark.sas7bdat")

The package has been tested out with Spark version 2.0.1 and Hadoop 2.7.

library(sparklyr)
spark_install(version = "2.0.1", hadoop_version = "2.7")

Speed comparison

In order to compare the functionality to the read_sas function from the haven package, below we show a comparison on a small 5234557 rows x 2 columns SAS dataset with only numeric data. Processing is done on 8 cores. With the haven package you need to import the data in RAM, with the spark.sas7bdat package, you can immediately execute dplyr statements on top of the SAS dataset.

mysasfile <- "/home/bnosac/Desktop/testdata.sas7bdat"
system.time(x <- spark_read_sas(sc, path = mysasfile, table = "testdata"))
   user  system elapsed 
  0.008   0.000   0.051 
system.time(x <- haven::read_sas(mysasfile))
   user  system elapsed 
  1.172   0.032   1.200 

Support in big data and Spark analysis

Need support in big data and Spark analysis? Contact BNOSAC: http://www.bnosac.be

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].