Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → edwindj → Chunked

edwindj / Chunked

Chunkwise Text-file Processing for 'dplyr'

Programming Languages

7636 projects

Labels

database dplyr chunk

Projects that are alternatives of or similar to Chunked

Querybuilder

SQL query builder, written in c#, helps you build complex queries easily, supports SqlServer, MySql, PostgreSql, Oracle, Sqlite and Firebird

Stars: ✭ 2,111 (+1279.74%)

Mutual labels: database

Spark With Python

Fundamentals of Spark with Python (using PySpark), code examples

Stars: ✭ 150 (-1.96%)

Mutual labels: database

Immudb

immudb - world’s fastest immutable database, built on a zero trust model

Stars: ✭ 3,743 (+2346.41%)

Mutual labels: database

Etcd Cloud Operator

Deploying and managing production-grade etcd clusters on cloud providers: failure recovery, disaster recovery, backups and resizing.

Stars: ✭ 149 (-2.61%)

Mutual labels: database

Deno Sqlite

Deno SQLite module

Stars: ✭ 151 (-1.31%)

Mutual labels: database

Tera

An Internet-Scale Database.

Stars: ✭ 1,846 (+1106.54%)

Mutual labels: database

Sqlitestudio

A free, open source, multi-platform SQLite database manager.

Stars: ✭ 2,337 (+1427.45%)

Mutual labels: database

Hbase Connectors

Apache HBase Connectors

Stars: ✭ 153 (+0%)

Mutual labels: database

Interview

Android、Java程序员面试资源总结，涉及Java、Android、网络、操作系统、算法等

Stars: ✭ 150 (-1.96%)

Mutual labels: database

Ebooks

A repository for ebooks， including C, C plus plus, Linux Kernel, Compiler, OS, Algorithm, Security, Database, Network, ML and DL

Stars: ✭ 151 (-1.31%)

Mutual labels: database

Relaxo

Relaxo is a transactional document database built on top of git.

Stars: ✭ 149 (-2.61%)

Mutual labels: database

Tidyheatmap

Draw heatmap simply using a tidy data frame

Stars: ✭ 151 (-1.31%)

Mutual labels: dplyr

Slimdump

A tool for creating configurable dumps of large MySQL-databases.

Stars: ✭ 151 (-1.31%)

Mutual labels: database

Bicing Api

Get statistics and locations of bicycle stations through REST API

Stars: ✭ 149 (-2.61%)

Mutual labels: database

H2gis

A spatial extension of the H2 database.

Stars: ✭ 152 (-0.65%)

Mutual labels: database

Doobie

Functional JDBC layer for Scala.

Stars: ✭ 1,910 (+1148.37%)

Mutual labels: database

Grimoire

Database access layer for golang

Stars: ✭ 151 (-1.31%)

Mutual labels: database

Myproxy

A sharding proxy for MYSQL databases

Stars: ✭ 153 (+0%)

Mutual labels: database

Norm

Access a database in one line of code.

Stars: ✭ 152 (-0.65%)

Mutual labels: database

Bats

面向 OLTP、OLAP、批处理、流处理场景的大一统 SQL 引擎

Stars: ✭ 152 (-0.65%)

Mutual labels: database

View All Similar Projects ➔

chunked

R is a great tool, but processing data in large text files is cumbersome. chunked helps you to process large text files with dplyr while loading only a part of the data in memory. It builds on the excellent R package LaF.

Processing commands are written in dplyr syntax, and chunked (using LaF) will take care that chunk by chunk is processed, taking far less memory than otherwise. chunked is useful for select-ing columns, mutate-ing columns and filter-ing rows. It is less helpful in group-ing and summarize-ation of large text files. It can be used in data pre-processing.

Install

'chunked' can be installed with

install.packages('chunked')

beta version with:

install.packages('chunked', repos=c('https://cran.rstudio.com', 'https://edwindj.github.io/drat'))

and the development version with:

devtools::install_github('edwindj/chunked')

Enjoy! Feedback is welcome...

Usage

Text file -> process -> text file

Most common case is processing a large text file, select or add columns, filter it and write the result back to a text file

  read_chunkwise("./large_file_in.csv", chunk_size=5000) %>% 
  select(col1, col2, col5) %>%
  filter(col1 > 10) %>% 
  mutate(col6 = col1 + col2) %>% 
  write_chunkwise("./large_file_out.csv")

chunked will write process the above statement in chunks of 5000 records. This is different from for example read.csv which reads all data into memory before processing it.

Text file -> process -> database

Another option is to use chunked as a preprocessing step before adding it to a database

db <- src_sqlite('test.db', create=TRUE)

tbl <- 
  read_chunkwise("./large_file_in.csv", chunk_size=5000) %>% 
  select(col1, col2, col5) %>%
  filter(col1 > 10) %>% 
  mutate(col6 = col1 + col2) %>% 
  write_chunkwise(db, 'my_large_table')
  
# tbl now points to the table in sqlite.

Db -> process -> Text file

Chunked can be used to export chunkwise to a text file. Note however that in that case processing takes place in the database and the chunkwise restrictions only apply to the writing.

Lazy processing

chunked will not start processing until collect or write_chunkwise is called.

data_chunks <- 
  read_chunkwise("./large_file_in.csv", chunk_size=5000) %>% 
  select(col1, col3)
  
# won't start processing until
collect(data_chunks)
# or
write_chunkwise(data_chunks, "test.csv")
# or
write_chunkwise(data_chunks, db, "test")

Syntax completion of variables of a chunkwise file in RStudio works like a charm...

Dplyr verbs

chunked implements the following dplyr verbs:

filter
select
rename
mutate
mutate_each
transmute
do
tbl_vars
inner_join
left_join
semi_join
anti_join

Since data is processed in chunks, some dplyr verbs are not implemented:

arrange
right_join
full_join

summarize and group_by are implemented but generate a warning: they operate on each chunk and not on the whole data set. However this makes is more easy to process a large file, by repeatedly aggregating the resulting data.

summarize
group_by

tmp <- tempfile()
write.csv(iris, tmp, row.names=FALSE, quote=FALSE)
iris_cw <- read_chunkwise(tmp, chunk_size = 30) # read in chunks of 30 rows for this example

iris_cw %>% 
  group_by(Species) %>%            # group in each chunk
  summarise( m = mean(Sepal.Width) # and summarize in each chunk
           , w = n()
           ) %>% 
  as.data.frame %>%                  # since each Species has 50 records, results will be in multiple chunks
  group_by(Species) %>%              # group the results from the chunk
  summarise(m = weighted.mean(m, w)) # and summarize it again

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 153

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (8) 🔗