All Projects → dataproofer → Dataproofer

dataproofer / Dataproofer

Licence: gpl-3.0
A proofreader for your data

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to Dataproofer

Tsv Utils
eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.
Stars: ✭ 1,215 (+93.47%)
Mutual labels:  cli, command-line, csv, data-science, data-mining
J
❌ Multi-format spreadsheet CLI (now merged in http://github.com/sheetjs/js-xlsx )
Stars: ✭ 343 (-45.38%)
Mutual labels:  excel, spreadsheet, cli, csv
Urs
Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.
Stars: ✭ 275 (-56.21%)
Mutual labels:  command-line, data-science, data-analysis, data-mining
Rightmove webscraper.py
Python class to scrape data from rightmove.co.uk and return listings in a pandas DataFrame object
Stars: ✭ 125 (-80.1%)
Mutual labels:  csv, data-science, data-analysis, data-mining
Django Rest Pandas
📊📈 Serves up Pandas dataframes via the Django REST Framework for use in client-side (i.e. d3.js) visualizations and offline analysis (e.g. Excel)
Stars: ✭ 1,030 (+64.01%)
Mutual labels:  excel, spreadsheet, csv
Elki
ELKI Data Mining Toolkit
Stars: ✭ 613 (-2.39%)
Mutual labels:  data-science, data-analysis, data-mining
Cookbook 2nd Code
Code of the IPython Cookbook, Second Edition, by Cyrille Rossant, Packt Publishing 2018 [read-only repository]
Stars: ✭ 541 (-13.85%)
Mutual labels:  data-science, data-analysis, data-mining
jupyterlab-spreadsheet-editor
JupyterLab spreadsheet editor for tabular data (e.g. csv, tsv)
Stars: ✭ 72 (-88.54%)
Mutual labels:  csv, excel, spreadsheet
Volbx
Graphical tool for data manipulation written in C++/Qt
Stars: ✭ 187 (-70.22%)
Mutual labels:  spreadsheet, csv, data-analysis
sheet2dict
Simple XLSX and CSV to dictionary converter
Stars: ✭ 206 (-67.2%)
Mutual labels:  csv, excel, spreadsheet
Pydataroad
open source for wechat-official-account (ID: PyDataLab)
Stars: ✭ 302 (-51.91%)
Mutual labels:  data-science, data-analysis, data-mining
Janitor
simple tools for data cleaning in R
Stars: ✭ 981 (+56.21%)
Mutual labels:  excel, data-science, data-analysis
Rows
A common, beautiful interface to tabular data, no matter the format
Stars: ✭ 739 (+17.68%)
Mutual labels:  excel, csv, data-science
Data Science With Ruby
Practical Data Science with Ruby based tools.
Stars: ✭ 549 (-12.58%)
Mutual labels:  data-science, data-analysis, data-mining
Sheetjs
📗 SheetJS Community Edition -- Spreadsheet Data Toolkit
Stars: ✭ 28,479 (+4434.87%)
Mutual labels:  excel, spreadsheet, csv
Visidata
A terminal spreadsheet multitool for discovering and arranging data
Stars: ✭ 4,606 (+633.44%)
Mutual labels:  spreadsheet, cli, csv
React Spreadsheet
Simple, customizable yet performant spreadsheet for React
Stars: ✭ 393 (-37.42%)
Mutual labels:  excel, spreadsheet, csv
Rio
A Swiss-Army Knife for Data I/O
Stars: ✭ 467 (-25.64%)
Mutual labels:  excel, csv, data-science
Octosql
OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL.
Stars: ✭ 2,579 (+310.67%)
Mutual labels:  cli, csv, data-analysis
Csview
📠 A high performance csv viewer with cjk/emoji support.
Stars: ✭ 208 (-66.88%)
Mutual labels:  cli, command-line, csv

Dataproofer

A proofreader for your data. Currently in beta.

Every day, more and more data is created. Journalists, analysts, and data visualizers turn that data into stories and insights.

But before you can make use of any data, you need to know if it’s reliable. Is it weird? Is it clean? Can I use it to write or make a viz?

This used to be a long manual process, using valuable time and introducing the possibility for human error. People can’t always spot every mistake every time, no matter how hard they try.

Data proofer is built to automate this process of checking a dataset for errors or potential mistakes.

Getting Started (Desktop)

Download a .zip of the latest release from the Dataproofer releases page.

Drag the app into your applications folder.

Select your dataset, which can be either a CSV on your computer, or a Google Sheet that you’ve published to the web.

Once you select your dataset, you can choose which suites and tests run by turning them on or off.

Proof your data, get your results, and feel confident about your dataset.

Getting Started (Command Line)

npm install -g dataproofer

Read the documentation

dataproofer --help
>  Usage: dataproofer <file>

  A proofreader for your data

  Options:

    -h, --help          output usage information
    -V, --version       output the version number
    -o, --out <file>    file to output results. default stdout
    -c, --core          run tests from the core suite
    -i, --info          run tests from the info suite
    -a, --stats         run tests from the statistical suite
    -g, --geo           run tests from the geographic suite
    -t, --tests <list>  comma-separated list to use
    -j, --json          output JSON of test results
    -J, --json-pretty   output an indented JSON of test results
    -S, --summary       output overall test results, excluding pass/fail results
    -v, --verbose       include descriptions about each column
    -x, --exclude       exclude tests that passed

  Examples:

    $ dataproofer my_data.csv

Run a test

node index.js data.csv

Save the results

node index.js --json data.csv --out data.json

Learn how to run specific test suites or tests and output longer or shorter summaries, use the --help flag.

Found a bug? Let us know.

Table of Contents

Test Suites

Information & Diagnostics

A set of tests that infer descriptive information based on the contents of a table's cells.

  • Check for numeric values in columns
  • Check for strings in columns

Core Suite

A set of tests related to common problems and data checks — namely, making sure data has not been truncated by looking for specific cut-off indicators.

  • Check for duplicate rows
  • Check for empty columns (no values)
  • Check for special, non-typical Latin characters/letters in strings
  • Check for big integer cut-offs as defined by MySQL and PostgreSQL, common database programs
  • Check for integer cut-offs as defined by MySQL and PostgreSQL, common database programs
  • Check for small integer cut-offs as defined by MySQL and PostgreSQL, common database programs
  • Check for whether there are exactly 65k rows — an indication there may be missing rows lost when the data was exported from a database
  • Check for strings that are exactly 255 characters — an indication there may be missing data lost when the data was exported from MySQL

Geo Suite

A set of tests related to common geographic data problems.

  • Check for invalid latitude and longitude values (values outside the range of -180º to 180º)
  • Check for void latitude and longitude values (values at 0º,0º)

Stats Suite

A set of test related to common statistical used to detect outlying data.

  • Check for outliers within a column relative to the column's median
  • Check for outliers within a column relative to the column's mean

Development

This repo contains two pieces of code, the core library that runs tests and the electron app which houses the UI. You can get them ready like so:

git clone https://github.com/dataproofer/Dataproofer.git 
cd Dataproofer
cd src
npm install
cd ../electron
npm install

You can run the development version of the app from the electron folder:

cd Dataproofer/electron
npm run electron

If you update the core library (index.js or src/*) you will need to npm install inside Dataproofer/electron for it to be updated, as we are relying on the "file:" dependency which copies the source instead of downloading it.

How You Can Help

Write a test

See our test to-do list and leave a comment

Add a feature

See our features list and leave a comment

Short on time?

See our smaller issues and leave a comment

Got more time?

See our medium-sized issues and leave a comment

Plenty of time?

See our larger issues and leave a comment

Modifying a test suite

All tests belong to a suite, which is essentially just a node module that packages a group of tests together. In order to modify a test or add a new test to a suite, you will want to clone the project and link it. Let's say we want to modify the core-suite.

git clone https://github.com/dataproofer/core-suite.git
cd core-suite
npm install
npm link

cd ../Dataproofer
cd electron
npm link dataproofer-core-suite

Now when you change anything inside core-suite (like editing a test or making a new one) you can see your changes reflected when you run the app. Follow the instructions below for creating a new test in your suite!

Creating a new test

  • Make a copy of the basic test template
  • Read the comments and follow along with links
  • Let us know if you're running into trouble dataproofer [at] dataproofer.org
  • require that test in a suite's index.js
  • Add that test to the exports in index.js

Tests are made up of a few parts. Here's a brief over-view. For a more in-depth look, dive into the documentation.

.name()

This is the name of your test. It shows up in the test-selection screen as well as on the results page

.description()

This is a text-only description of what the test does, and what it is meant to check. Imagine you are explaining it to a remarkably intelligent 5-year-old.

.methodology()

This is where the code your test executes lives. Pass it a function that takes in rows and columnHeads

rows is an array of objects from the data. The object uses column headers as the key, and the row’s value as the value.

So if your data looks like this:

President         | Year
------------------------
George Washington | 1789
John Adams        | 1797
Thomas Jefferson  | 1801

Then the first object in your array of rows will look like this:

{ president: ‘George Washington’, year: ‘1789’ } and so on

Generally, to run a test, you are going to want to loop over each row and do some operations on it — counting cells and using conditionals to detect unwanted values.

Helper Scripts

Helper scripts help you test and display the results of Dataproofer tests. These are a small set of functions we've found ourselves reusing.

  • isEmpty: detect if a cell is empty
  • isNumeric: detect if a cell contains a number
  • stripNumeric: remove number formatting like "$" or "%"
  • percent: return a number with a "%" sign

For more information, please see the full util documentation

Troubleshooting a test that won't run

Tests are run inside a try catch loop in src/processing.js. You may wish to temporarily remove the try/catch while iterating on a test. Otherwise, for now we recommend heavy doses of console.log and the Chrome debugger.

Iterating on tests

Dataproofer saves a copy of the most recently loaded file in the Application Data directory provided to it by the OS. You can quickly load the file and run the tests by typing loadLastFile() in the console. This saves you several clicks for loading the file and clicking the run button while you are iterating on a test. If you want to temporarily avoid any clicks you can add the function call to the ipc.on("last-file-selected", event handler in electron/js/controller.js

Packaging an executable

./build-executables.sh

This will create a new folder inside Dataproofer/executables that contains a Mac OS X, Windows, & Linux.

Release a new version

We can push releases to GitHub manually for now:

git tag -a 'v0.1.1' -m "first release"
git push && git push --tags

The binary (Dataproofer.app) can be uploaded to the releases page for the tag you pushed, and should be zipped up first (Right click and choose "Compress Dataproofer")

Sources

Thank You

vocativ-logo
knight-logo

A huge thank you to the Vocativ and the Knight Foundation. This project was funded in part by the Knight Foundation's Prototype Fund.

Special Thanks

  • Alex Koppelman (interviewee), Editorial Director @ Vocativ
  • Allee Manning (interviewee), Data Reporter @ Vocativ
  • Allegra Denton (design consulting), Designer @ Vocativ
  • Brian Byrne (interviewee), Data Reporter @ Vocativ
  • Daniel Littlewood (video producer), Special Projects Producer @ Vocativ
  • EJ Fox (project lead), Dataviz Editor @ Vocativ
  • Gerald Rich (lead developer), Interactive Producer @ Vocativ
  • Ian Johnson (lead developer), Dataproofer
  • Jason Das (UX and design), Dataproofer
  • Joe Presser (video producer), Dataproofer
  • Julia Kastner (concept & name consulting), Project Manager @ Vocativ
  • Kelli Vanover (design consulting), Product Manager @ Vocativ
  • Markham Nolan (interviewee), Visuals Editor @ Vocativ
  • Rob Di Ieso (design consulting), Art Director @ Vocativ

... and the countless journalists who've encouraged us along the way. Thank you!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].