All Projects → TheRoniOne → Cleaner.jl

TheRoniOne / Cleaner.jl

Licence: MIT license
A toolbox of simple solutions for common data cleaning problems.

Programming Languages

julia
2034 projects

Projects that are alternatives of or similar to Cleaner.jl

Drugs Recommendation Using Reviews
Analyzing the Drugs Descriptions, conditions, reviews and then recommending it using Deep Learning Models, for each Health Condition of a Patient.
Stars: ✭ 35 (+66.67%)
Mutual labels:  data-cleaning
Pandas Videos
Jupyter notebook and datasets from the pandas Q&A video series
Stars: ✭ 1,716 (+8071.43%)
Mutual labels:  data-cleaning
Miller
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
Stars: ✭ 4,633 (+21961.9%)
Mutual labels:  data-cleaning
Optimus
🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (+4595.24%)
Mutual labels:  data-cleaning
Refinr
Cluster and merge similar char values: an R implementation of Open Refine clustering algorithms
Stars: ✭ 91 (+333.33%)
Mutual labels:  data-cleaning
Cleanlab
The standard package for machine learning with noisy labels, finding mislabeled data, and uncertainty quantification. Works with most datasets and models.
Stars: ✭ 2,526 (+11928.57%)
Mutual labels:  data-cleaning
Moodle Local datacleaner
Reduce, filter, and anonymize moodle data for non-prod environments
Stars: ✭ 12 (-42.86%)
Mutual labels:  data-cleaning
HoloClean-Legacy-deprecated
A Machine Learning System for Data Enrichment.
Stars: ✭ 75 (+257.14%)
Mutual labels:  data-cleaning
Dat8
General Assembly's 2015 Data Science course in Washington, DC
Stars: ✭ 1,516 (+7119.05%)
Mutual labels:  data-cleaning
Voicebook
🗣️ A book and repo to get you started programming voice computing applications in Python (10 chapters and 200+ scripts).
Stars: ✭ 236 (+1023.81%)
Mutual labels:  data-cleaning
Clean
Fast and Easy Data Cleaning (in R)
Stars: ✭ 49 (+133.33%)
Mutual labels:  data-cleaning
Bumblebee
🚕 A spreadsheet-like data preparation web app that works over Optimus (pandas, dask, cuDF, dask-cuDF and PySpark)
Stars: ✭ 86 (+309.52%)
Mutual labels:  data-cleaning
Machine Learning Workflow With Python
This is a comprehensive ML techniques with python: Define the Problem- Specify Inputs & Outputs- Data Collection- Exploratory data analysis -Data Preprocessing- Model Design- Training- Evaluation
Stars: ✭ 157 (+647.62%)
Mutual labels:  data-cleaning
Janitor
simple tools for data cleaning in R
Stars: ✭ 981 (+4571.43%)
Mutual labels:  data-cleaning
R-Learning-Journey
Some of the projects i made when starting to learn R for Data Science at the university
Stars: ✭ 19 (-9.52%)
Mutual labels:  data-cleaning
Data Forge Ts
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
Stars: ✭ 967 (+4504.76%)
Mutual labels:  data-cleaning
Datamaid
An R package for data screening
Stars: ✭ 120 (+471.43%)
Mutual labels:  data-cleaning
FIFA-2019-Analysis
This is a project based on the FIFA World Cup 2019 and Analyzes the Performance and Efficiency of Teams, Players, Countries and other related things using Data Analysis and Data Visualizations
Stars: ✭ 28 (+33.33%)
Mutual labels:  data-cleaning
optimus
🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Stars: ✭ 1,351 (+6333.33%)
Mutual labels:  data-cleaning
Klib
Easy to use Python library of customized functions for cleaning and analyzing data.
Stars: ✭ 192 (+814.29%)
Mutual labels:  data-cleaning

Cleaner

Stable Dev Build Status Coverage

A toolbox of simple solutions for common data cleaning problems.

Compatible with any Tables.jl implementation.

Installation: At the Julia REPL, using Pkg; Pkg.add("Cleaner")

Key Features

With Cleaner.jl you will be able to:

  • Format column names to make them unique and fit snake_case or camelCase style.
  • Remove rows and columns filled with different kinds of empty values. e.g: missing, "", "NA", "None"
  • Delete columns filled with just a constant value.
  • Delete rows with at least one missing value.
  • Use a row as the names of the columns.
  • Minimize the amount of element types for each column without making the column of type Any.
  • Add a row index to your table.
  • Automatically use multiple threads if your data is big enough (and you are running Julia with more than 1 thread).
  • Rematerialize your original source Tables.jl type, as CleanTable implements the Tables.jl interface too.
  • Apply Cleaner transformations on your original table implementation and have the resulting table be of the same type as the original.
  • Get all repeated values or value combinations that are supposed to be unique.
  • Get the percentage distribution of the different categories that make up your table.
  • Compare tables to help solve join or merge problems caused by having different schemas.

To keep in mind

  • All non mutating functions (those ending without a !) recieve a table as argument and return a CleanTable.
  • All mutating functions (those ending with a !) recieve a CleanTable and return a CleanTable.
  • All returning original type function variants (those ending with ROT) recieve a table as argument and return a table of the same type of the original.

So you can start your workflow with a non mutating function and continue it using mutating ones if you prefer. E.g.

julia> df = DataFrame(" some bad Name" => [missing, missing, missing], "Another_weird name " => [1, 2, 3])
3×2 DataFrame
 Row │  some bad Name  Another_weird name
     │ Missing         Int64
─────┼─────────────────────────────────────
   1missing                    1
   2missing                    2
   3missing                    3

julia> df |> polish_names |> compact_columns!
┌────────────────────┐
│ another_weird_name │
│              Int64 │
├────────────────────┤
│                  1 │
│                  2 │
│                  3 │
└────────────────────┘

Related Packages

Acknowledgement

Inspired by janitor from the R ecosystem.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].