All Projects → traitecoevo → datastorr

traitecoevo / datastorr

Licence: other
Simple data versioning and distribution

Programming Languages

r
7636 projects
Makefile
30231 projects
shell
77523 projects

Projects that are alternatives of or similar to datastorr

rreddit
𝐫⟋ Get Reddit data
Stars: ✭ 49 (-14.04%)
Mutual labels:  r-package
rodev
⛔ ARCHIVED ⛔ Helper for rOpenSci Package Developpers
Stars: ✭ 24 (-57.89%)
Mutual labels:  r-package
ghql
GraphQL R client
Stars: ✭ 128 (+124.56%)
Mutual labels:  r-package
tsmp
R Functions implementing UCR Matrix Profile Algorithm
Stars: ✭ 63 (+10.53%)
Mutual labels:  r-package
BAS
BAS R package https://merliseclyde.github.io/BAS/
Stars: ✭ 36 (-36.84%)
Mutual labels:  r-package
colocr
An R package for conducting co-localization analysis. Edit
Stars: ✭ 22 (-61.4%)
Mutual labels:  r-package
gm
R Package for Music Score and Audio Generation
Stars: ✭ 116 (+103.51%)
Mutual labels:  r-package
phsmethods
An R package to standardise methods used in Public Health Scotland (https://public-health-scotland.github.io/phsmethods/)
Stars: ✭ 43 (-24.56%)
Mutual labels:  r-package
opencage
🌐 R package for the OpenCage API -- both forward and reverse geocoding 🌐
Stars: ✭ 82 (+43.86%)
Mutual labels:  r-package
realtime
No description or website provided.
Stars: ✭ 15 (-73.68%)
Mutual labels:  r-package
easyclimate
Easy access to high-resolution daily climate data for Europe
Stars: ✭ 26 (-54.39%)
Mutual labels:  r-package
MAnorm2
MAnorm2 for Normalizing and Comparing ChIP-seq Samples
Stars: ✭ 15 (-73.68%)
Mutual labels:  r-package
bcdata
An R package for searching & retrieving data from the B.C. Data Catalogue
Stars: ✭ 68 (+19.3%)
Mutual labels:  r-package
jsonvalidate
✔️⁉️ Validate JSON
Stars: ✭ 43 (-24.56%)
Mutual labels:  r-package
awspack
Amazon Web Services Bundle Package
Stars: ✭ 14 (-75.44%)
Mutual labels:  r-package
nflfastR
A Set of Functions to Efficiently Scrape NFL Play by Play Data
Stars: ✭ 268 (+370.18%)
Mutual labels:  r-package
TDAstats
R pipeline for computing persistent homology in topological data analysis. See https://doi.org/10.21105/joss.00860 for more details.
Stars: ✭ 26 (-54.39%)
Mutual labels:  r-package
inline
Inline C, C++ or Fortran functions in R
Stars: ✭ 33 (-42.11%)
Mutual labels:  r-package
flyio
Input Output Files in R from Cloud or Local
Stars: ✭ 46 (-19.3%)
Mutual labels:  r-package
mapr
Map species occurrence data
Stars: ✭ 34 (-40.35%)
Mutual labels:  r-package

datastorr

Simple data retrieval and versioning using GitHub

This project is described in a paper by Daniel Falster, Rich FitzJohn, Matt Pennell, and Will Cornwell. Below we describe the motivation and general idea. Please see the paper for full details.

The problem

Over the last several years, there has been an increasing recognition that data is a first-class scientific product and a tremendous about of repositories and platforms have been developed to facilitate the storage, sharing, and re-use of data. However we think there is still an important gap in this ecosystem: platforms for data sharing offer limited functions for distributing and interacting with evolving datasets - those that continue to grow with time as more records are added, errors fixed, and new data structures are created. This is particularly the case for small to medium sized datasets that a typical scientific lab, or collection of labs, might produce.

In addition to enabling data creators to maintain and share a living dataset, ideally, such an infrastructure would allow enable data users to:

  • Cache downloads, including across R sessions, to make things faster and to work offline
  • Keep track of which versions are downloaded and available remotely
  • Access multiple versions of the data at once; this would be especially helpful if trying to understand why results have changed with the version of the data.

How datastorr helps

This package can be used in two ways:

  1. Use data stored elsewhere in R efficiently (e.g., work with csv files that are too large to comfortably fit in git).
  2. Create another lightweight package designed to allow easy access to your data.

For both of these use-cases, datastorr will store your data using GitHub releases which do not clog up your repository but allow up to 2GB files to be stored (future versions may support things like figshare).

datastorr is concerned about a simple versioning scheme for your data. If you do not imagine the version changing that should not matter. But if you work with data that changes (and everyone does eventually) this approach should make it easy to update files.

From the point of view of a user, using your data could be as simple as:

d <- datastorr::datastorr("richfitz/datastorr.example")

(see below for details, how this works, and what it is doing).

End user interface

See here for the aim from the point of view for an end user.

They would install your package (which contains no data so is nice and light and can be uploaded to CRAN).

devtools::install_github("richfitz/datastorr.example")

The user can see what versions they have locally

datastorr.example::mydata_versions()

and can see what versions are present on GitHub:

datastorr.example::mydata_versions(local=FALSE) # remote

To download the most recent dataset:

d <- datastorr.example::mydata()

Subsequent calls (even across R sessions) are cached so that the mydata() function is fast enough you can use it in place of the data.

To get a particular version:

d <- datastorr.example::mydata("0.0.1")

Downloads are cached across sessions using rappdirs.

Package developer process

The simplest way is to run the (hidden) function datastorr:::autogenerate, as

datastorr:::autogenerate(repo="richfitz/datastorr.example", read="readRDS", name="mydata")

which will print to the screen a bunch of code to add do your package. There will be a vignette explaining this more fully soon. A file generated in this way can be seen here.

Once set up, new releases can be made by running, within your package directory:

datastorr.example::mydata_release("description of release", "path/to/file")

provided you have your GITHUB_TOKEN environment variable set appropriatey. See the vignette for more details.

Installation

devtools::install_github("ropenscilabs/datastorr")

License

MIT + file LICENSE © Rich FitzJohn.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].