All Projects → leeper → UNF

leeper / UNF

Licence: other
Tools for Creating Universal Numeric Fingerprints for Data

Programming Languages

r
7636 projects
Makefile
30231 projects

Projects that are alternatives of or similar to UNF

Gensum
Powerful checksum generator!
Stars: ✭ 12 (-40%)
Mutual labels:  checksum
Md5 File
return an md5sum of a given file
Stars: ✭ 176 (+780%)
Mutual labels:  checksum
433MHz Tx Rx
Arduino based 433MHz Tx and Rx combo using Manchester protocol
Stars: ✭ 27 (+35%)
Mutual labels:  checksum
Bagit Java
Java library to support the BagIt specification.
Stars: ✭ 65 (+225%)
Mutual labels:  checksum
Sha3sum
[Feature complete] SHA-3 and Keccak checksum utility
Stars: ✭ 136 (+580%)
Mutual labels:  checksum
Highwayhash
Node.js implementation of HighwayHash, Google's fast and strong hash function
Stars: ✭ 183 (+815%)
Mutual labels:  checksum
Openhashtab
📝 File hashing and checking shell extension
Stars: ✭ 599 (+2895%)
Mutual labels:  checksum
checksum
Plugin for Nextcloud and ownCloud to create hashes of files.
Stars: ✭ 25 (+25%)
Mutual labels:  checksum
Gtkhash
A cross-platform desktop utility for computing message digests or checksums
Stars: ✭ 167 (+735%)
Mutual labels:  checksum
virt-backup
Fully backup your KVM Virtual Machines
Stars: ✭ 27 (+35%)
Mutual labels:  checksum
Crchack
Reversing CRC for fun and profit
Stars: ✭ 84 (+320%)
Mutual labels:  checksum
Hashlib4pascal
Hashing for Modern Object Pascal
Stars: ✭ 132 (+560%)
Mutual labels:  checksum
Emojisum
🙏 📎 Emoji that checksum! 🎉 💩
Stars: ✭ 230 (+1050%)
Mutual labels:  checksum
Tar Split
checksum-reproducible tar archives (utility/library)
Stars: ✭ 52 (+160%)
Mutual labels:  checksum
Dataverse-Data-Explorer
A tool to visualize Dataverse DDI Metadata
Stars: ✭ 14 (-30%)
Mutual labels:  dataverse
Pgbackrest
Reliable PostgreSQL Backup & Restore
Stars: ✭ 766 (+3730%)
Mutual labels:  checksum
Rebel Framework
Advanced and easy to use penetration testing framework 💣🔎
Stars: ✭ 183 (+815%)
Mutual labels:  checksum
D3hex
The first dataflow based Hex-Editor!
Stars: ✭ 45 (+125%)
Mutual labels:  checksum
getsum
Tool for validating and calculating checksums
Stars: ✭ 27 (+35%)
Mutual labels:  checksum
Vanityeth
Ethereum vanity address generator
Stars: ✭ 245 (+1125%)
Mutual labels:  checksum

Universal Numeric Fingerprint

UNF is a cryptographic hash or signature that can be used to uniquely identify (a version of) a rectangular dataset, or a subset thereof. UNF can be used, in tandem with a DOI or Handle, to form a persistent citation to a versioned dataset. A UNF signature is printed in the following form:

UNF:[UNF version][:UNF header options]:[UNF hash]

This allows a data consumer to quickly, easily, and definitively verify an in-hand data file against a data citation or to test for the equality of two datasets, regardless of their variable order or file format. UNF is used by The Dataverse Network archiving software for data citation (making the UNF package a logical companion to the dvn package). This package implements UNF versions 3 and up (current version is 6). Some details on the UNF algorithm and the R implementation thereof are included in a package vignette ("The UNF Algorithm") and details on use of UNF in data citation is available in another vignette ("Data Citation with UNF").

Please report any mismatches between this implementation and any other implementation (including Dataverse's) on the issues page!

Why UNFs?

While file checksums are a common strategy for verifying a file (e.g., md5 sums are available for validating R packages), they are not well-suited to being used as global signatures for a dataset. A UNF differs from an ordinary file checksum in several important ways:

  1. UNFs are format independent. The UNF for a dataset will be the same regardless of whether the data is saved as a R binary format, SAS formatted file, Stata formatted file, etc., but file checksums will differ. The UNF is also independent of variable arrangement and naming, which can be unintentionally changed during file reading.

    library("digest")
    library("UNF")
    write.csv(iris, file = "iris.csv", row.names = FALSE)
    iris2 <- read.csv("iris.csv")
    identical(iris, iris2)
    ## [1] FALSE
    
    identical(digest(iris, "md5"), digest(iris2, "md5"))
    ## [1] FALSE
    
    identical(unf(iris), unf(iris2))
    ## [1] TRUE
    
  2. UNFs are robust to insignificant rounding error. This important when dealing with floating-point numeric values. A UNF will also be the same if the data differs in non-significant digits, a file checksum not.

    x1 <- 1:20
    x2 <- x1 + 1e-7
    identical(digest(x1), digest(x2))
    ## [1] FALSE
    
    identical(unf(x1), unf(x2))
    ## [1] TRUE
    
  3. UNFs detect misinterpretation of the data by statistical software. If the statistical software misreads the file, the resulting UNF will not match the original, but the file checksums may match. For example, numeric values read as character will produce a different UNF than those values read in as numerics.

    x1 <- 1:20
    x2 <- as.character(x1)
    identical(unf(x1), unf(x2))
    ## [1] FALSE
    
  4. UNFs are strongly tamper resistant. Any accidental or intentional changes to data values will change the resulting UNF. Most file checksums and descriptive statistics detect only certain types of changes.

Package Functionality

  • unf(): The core unf() function calculates the UNF signature for almost any R object for UNF algorithm versions 3, 4, 4.1, 5, or 6, with options to control the rounding of numeric values, truncation of character strings, and some idiosyncratic details of the UNFv5 algorithm as implemented by Dataverse. unf() is a wrapper for functions unf6(), unf5(), unf4(), and unf3(), which calculate vector-level UNF signatures.

    unf(iris)
    ## UNF6:6oVTvlCR+F1W1HTJ/QUmkA==
    
    str(unf(iris))
    ## List of 5
    ##  $ unf      : chr "6oVTvlCR+F1W1HTJ/QUmkA=="
    ##  $ hash     : raw [1:32] ea 85 53 be ...
    ##  $ unflong  : chr "6oVTvlCR+F1W1HTJ/QUmkHEAyPC4LZiHnI1s2rURxbs="
    ##  $ formatted: chr "UNF6:6oVTvlCR+F1W1HTJ/QUmkA=="
    ##  $ variables: Named chr [1:5] "FnQvOCZE9tcn64bP78wLag==" "epaV+rjvURem8qIo0r9LBQ==" "KP6tL8gFSqnG3FLJ887o/g==" "TN39UY6H/vRGv4ARWQTXrw==" ...
    ##   ..- attr(*, "names")= chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
    ##  - attr(*, "class")= chr "UNF"
    ##  - attr(*, "version")= num 6
    ##  - attr(*, "digits")= int 7
    ##  - attr(*, "characters")= int 128
    ##  - attr(*, "truncation")= int 128
    
  • %unf%: %unf% is a binary operator that can compare two R objects, or an R object against a "UNF" class summary (e.g., as stored in a study metadata record, or returned by unf()). The function tests whether the objects are identical and, if they are not, provides object- and variable-level UNF comparisons between the two objects, checks for difference in the sorting of the two objects, and (for dataframes) reports indices for rows seemingly present in one object but missing from the other based on row-level hashes of variables common to both dataframes. This can be used both to compare two objects in general (e.g., to see whether two dataframes differ) as well as to debug incongruent UNFs. Two UNFs can differ dramatically due to minor changes like rounding, the deletion of an observation, addition of a variable, etc., so %unf% provides a useful tool for looking under the hood at the differences between data objects that might produce different UNF signatures.

    u <- unf(iris)
    unf(iris) %unf% u
    ## Objects are identical
    ## 
    ## UNF6:6oVTvlCR+F1W1HTJ/QUmkA== 
    ## 
    ## UNF6:6oVTvlCR+F1W1HTJ/QUmkA==
    
    unf(iris) %unf% unf(iris[,1:3])
    ## Objects are not identical
    ## 
    ## UNF6:6oVTvlCR+F1W1HTJ/QUmkA== 
    ## Mismatched variables:
    ## Petal.Width: TN39UY6H/vRGv4ARWQTXrw==
    ## Species: Xqh76nYY3z8eTfmL1KfxaQ==
    ## 
    ## UNF6:lEajCAiTPXcxJuP+hr8Kew==
    
    unf(iris) %unf% head(iris[,1:3])
    ## Objects are not identical
    ## 
    ## UNF6:6oVTvlCR+F1W1HTJ/QUmkA== 
    ## Mismatched variables:
    ## Sepal.Length: FnQvOCZE9tcn64bP78wLag==
    ## Sepal.Width: epaV+rjvURem8qIo0r9LBQ==
    ## Petal.Length: KP6tL8gFSqnG3FLJ887o/g==
    ## Petal.Width: TN39UY6H/vRGv4ARWQTXrw==
    ## Species: Xqh76nYY3z8eTfmL1KfxaQ==
    ## 
    ## UNF6:0Ppu3rquJJrYvjkDePjGbA== 
    ## Mismatched variables:
    ## Sepal.Length: yMtrQJDMuxcSay0afKLz5A==
    ## Sepal.Width: e6etgUxSU/7XccLSwNzHVQ==
    ## Petal.Length: oSk42LS4+joAOdTAr9OChQ==
    
  • as.unfvector() is an S3 generic method that standardizes any R vector into the standardized character representation described by the UNF specification. While this functionality is primarily for internal use, it can be helpful for clarifying the difference (or lack thereof) between floating point numbers or between objects with identical meaning but different class representations that perhaps resulted for flawed data importing:

    # floating point ambiguity
    .14*10 == 1.4
    ## [1] FALSE
    
    as.unfvector(.14*10) == as.unfvector(1.4)
    ## [1] TRUE
    
    # substantively irrelevant class differences
    c(0L, 1L) == c(FALSE, TRUE)
    ## [1] TRUE TRUE
    
    as.unfvector(c(0L, 1L))
    ## [1] "+0.e+" "+1.e+"
    
    as.unfvector(c(FALSE, TRUE))
    ## [1] "+0.e+" "+1.e+"
    

Installation

CRAN Build Status Build status codecov.io Downloads

UNF is on CRAN. To install the latest version, simply use:

install.packages("UNF")

To install the latest development version of UNF from GitHub:

# latest (potentially unstable) version from GitHub
if (!require("remotes")) {
    install.packages("remotes")
}
remotes::install_github("leeper/UNF")
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].