All Projects → capitalone → Datacomparer

capitalone / Datacomparer

Licence: apache-2.0
dataCompareR is an R package that allows users to compare two datasets and view a report on the similarities and differences.

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to Datacomparer

Flyte
Accelerate your ML and Data workflows to production. Flyte is a production grade orchestration system for your Data and ML workloads. It has been battle tested at Lyft, Spotify, freenome and others and truly open-source.
Stars: ✭ 1,242 (+2041.38%)
Mutual labels:  data-science, data-analysis, data
Data Science Resources
👨🏽‍🏫You can learn about what data science is and why it's important in today's modern world. Are you interested in data science?🔋
Stars: ✭ 171 (+194.83%)
Mutual labels:  data-science, data-analysis, data
Gopup
数据接口:百度、谷歌、头条、微博指数,宏观数据,利率数据,货币汇率,千里马、独角兽公司,新闻联播文字稿,影视票房数据,高校名单,疫情数据…
Stars: ✭ 1,229 (+2018.97%)
Mutual labels:  data-science, data-analysis, data
Pycm
Multi-class confusion matrix library in Python
Stars: ✭ 1,076 (+1755.17%)
Mutual labels:  data-science, data-analysis, data
Datacleaner
The premier open source Data Quality solution
Stars: ✭ 391 (+574.14%)
Mutual labels:  data-science, data-analysis, data
Graphia
A visualisation tool for the creation and analysis of graphs
Stars: ✭ 67 (+15.52%)
Mutual labels:  data-science, data-analysis, data
Airbyte
Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
Stars: ✭ 4,919 (+8381.03%)
Mutual labels:  data-science, data-analysis, data
Data Science Hacks
Data Science Hacks consists of tips, tricks to help you become a better data scientist. Data science hacks are for all - beginner to advanced. Data science hacks consist of python, jupyter notebook, pandas hacks and so on.
Stars: ✭ 273 (+370.69%)
Mutual labels:  data-science, data-analysis, data
Akshare
AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库
Stars: ✭ 4,334 (+7372.41%)
Mutual labels:  data-science, data-analysis, data
Datascience course
Curso de Data Science em Português
Stars: ✭ 294 (+406.9%)
Mutual labels:  data-science, data-analysis, data
Skdata
Python tools for data analysis
Stars: ✭ 16 (-72.41%)
Mutual labels:  data-science, data-analysis, data
Knowledge Repo
A next-generation curated knowledge sharing platform for data scientists and other technical professions.
Stars: ✭ 4,956 (+8444.83%)
Mutual labels:  data-science, data-analysis, data
Openrefine
OpenRefine is a free, open source power tool for working with messy data and improving it
Stars: ✭ 8,531 (+14608.62%)
Mutual labels:  data-science, data-analysis, data
Socrat
A Dynamic Web Toolbox for Interactive Data Processing, Analysis, and Visualization
Stars: ✭ 26 (-55.17%)
Mutual labels:  data-science, data-analysis
Dataflowjavasdk
Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
Stars: ✭ 854 (+1372.41%)
Mutual labels:  data-science, data-analysis
Pandas Profiling
Create HTML profiling reports from pandas DataFrame objects
Stars: ✭ 8,329 (+14260.34%)
Mutual labels:  data-science, data-analysis
Resources
PyMC3 educational resources
Stars: ✭ 930 (+1503.45%)
Mutual labels:  data-science, data-analysis
Data Science On Gcp
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Stars: ✭ 864 (+1389.66%)
Mutual labels:  data-science, data-analysis
Art Data Science
The Art of Data Science
Stars: ✭ 32 (-44.83%)
Mutual labels:  data-science, data-analysis
Mlcourse.ai
Open Machine Learning Course
Stars: ✭ 7,963 (+13629.31%)
Mutual labels:  data-science, data-analysis

dataCompareR

CRAN downloads dev build master build
Build Status Build Status

dataCompareR is an R package that allows users to compare two datasets and view a report on the similarities and differences.

dataCompareR aims to make it easy to compare two tabular data objects in R. It’s specifically designed to show differences between two sets of data in a useful way that should make it easier to understand the differences, and if necessary, help you work out how to remedy them. In this regard, it aims to offer a more useful output than all.equal when your two datasets do not match, but isn’t intended to replace all.equal if you just want a binary test for equality.

  • rCompare() does the comparison and creates a dataCompareR object containing all the differences between the two inputted datasets. The object can be used with print and summary.
  • generateMismatchData() generates a list of two data frames, each having the missing rows from the comparison.
  • saveReport() creates a summary of the comparison that is saved into a file.

It’s expected that dataCompareR will be used to compare data frames, but it can be used to compare any objects that can be coerced to data frames, such as data tables, tibbles or matrices. dataCompareR cannot compare data that is not tabular in format (nested JSON, irregular lists etc) but does handle tabular data that needs to be matched (or joined) on one or more keys (or ID columns).

Getting started

Requirements

Confirmed as working on R v3.6.3 and v4.0.0 for Windows, as well as v3.6.2, v4.0.0 and the devel release for Linux. Package was built with the following dependencies, but we anticipate it will work with later versions of these packages.

Package Version Source code URL
dplyr 0.5.0 https://github.com/hadley/dplyr
knitr 1.12.3 https://github.com/yihui/knitr
stringi 1.0-1 https://github.com/gagolews/stringi
markdown 0.7.7 https://github.com/rstudio/markdown

Installing the package

You can install from the CRAN via:

install.packages("dataCompareR")

You can also install the latest version directly from GitHub via

library(devtools)
install_git('https://github.com/capitalone/dataCompareR.git', branch = 'master',
            subdir = 'dataCompareR', type = 'source', repos = NULL,
            build_vignettes = TRUE)

Using dataCompareR

Please run vignette('dataCompareR') after installation to see an example of the dataCompareR workflow.

Repo Contents

The code is arranged as an R package, with the following contents:

  • dataCompareR/R
  • dataCompareR/man
  • dataCompareR/tests/testthat
  • dataCompareR/tests/performancetesting
  • dataCompareR/inst/css
  • dataCompareR/vignette

The contents will be covered below.

dataCompareR/R

The main body of R code that provide the dataCompareR functionality.

The R package format mandates that this is a flat folder structure. Initial development had a nested structure, so to try to maintain this as far as possible, the naming convention for files is to preface them with 2 or 3 letter code that identifies the part of the code that file belongs to. The codes and hierarchy is as follows

  • rc - rCompare - the entry point of the function
    • pf - processFlow - handles the flow of an rCompare run
      • vd - validateData - checks the data is suitable before starting an rCompare run
      • pd - prepareData - prepares the input data for comparison
      • cd - compareData - does the comparison
    • rco - rCompare object - routines to handle the rCompare object that is generated by an rCompare run
    • out - output - code to provide various views of the output

The filenames follow the format of the prefix, followed by underscore, followed by a camelcase description of what the code does. The .R files tend to have either 1 function inside them, or a small number of related functions.

dataCompareR/man

Code is commented using ROxygen2 headers, which is used to automatically create the required R man pages by running

devtools::document()

dataCompareR/tests/testthat

Automated tests that are run via

devtools::test()

This consists of both unit tests and some end-to-end tests that MUST pass before any code is merged to dev or main. We've added Travis integration, so this is now mandated. If your development code change breaks an existing test, then it is your responsibility to fix it!

The current unit test coverage can be found in testing.md - please feel free to add more tests, and regenerate this file using covR.

dataCompareR/tests/performancetesting

This folder contains useful repeatable performance tests, but there are not run automatically, and the results they produce can only be interpreted manually.

CRAN Release Version History

https://cran.r-project.org/package=dataCompareR

  • Version 0.1.0 released on 2017-07-17
  • Version 0.1.1 released on 2017-11-14
  • Version 0.1.2 released on 2019-09-07
  • Version 0.1.3 released on 2020-05-01

External Contributors

We welcome and appreciate your contributions! Before we can accept any contributions, we ask that you please be sure to sign the Contributor License Agreement (CLA).

This project adheres to the Open Source Code of Conduct. By participating, you are expected to honor this code.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].