Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → ChrisMuir → Refinr

ChrisMuir / Refinr

Cluster and merge similar char values: an R implementation of Open Refine clustering algorithms

Programming Languages

7636 projects

Labels

rstats clustering cran data-cleaning fuzzy-matching

Projects that are alternatives of or similar to Refinr

pbapply

Adding progress bar to '*apply' functions in R

Stars: ✭ 115 (+26.37%)

Mutual labels: cran, rstats

Dataexplorer

Automate Data Exploration and Treatment

Stars: ✭ 362 (+297.8%)

Mutual labels: cran, rstats

rchess

♛ Chess package for R

Stars: ✭ 68 (-25.27%)

Mutual labels: cran, rstats

Dat8

General Assembly's 2015 Data Science course in Washington, DC

Stars: ✭ 1,516 (+1565.93%)

Mutual labels: data-cleaning, clustering

Ggrepel

📍 Repel overlapping text labels away from each other.

Stars: ✭ 853 (+837.36%)

Mutual labels: cran, rstats

heddlr

Bring a functional programming mindset to R Markdown document generation

Stars: ✭ 14 (-84.62%)

Mutual labels: cran, rstats

cattonum

Encode Categorical Features

Stars: ✭ 31 (-65.93%)

Mutual labels: cran, rstats

Collapse

Advanced and Fast Data Transformation in R

Stars: ✭ 184 (+102.2%)

Mutual labels: cran, rstats

Stream

A framework for data stream modeling and associated data mining tasks such as clustering and classification. - R Package

Stars: ✭ 23 (-74.73%)

Mutual labels: clustering, cran

Talisman

Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.

Stars: ✭ 584 (+541.76%)

Mutual labels: fuzzy-matching, clustering

Dbscan

Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package

Stars: ✭ 161 (+76.92%)

Mutual labels: clustering, cran

Liger

Lightweight Iterative Gene set Enrichment in R

Stars: ✭ 44 (-51.65%)

Mutual labels: cran, rstats

Mlr

Machine Learning in R

Stars: ✭ 1,542 (+1594.51%)

Mutual labels: clustering, cran

scclusteval

Single Cell Cluster Evaluation

Stars: ✭ 57 (-37.36%)

Mutual labels: clustering, rstats

Reactr

React for R

Stars: ✭ 227 (+149.45%)

Mutual labels: cran, rstats

ctv

CRAN Task View Initiative

Stars: ✭ 17 (-81.32%)

Mutual labels: cran, rstats

D3r

d3.js helpers for R

Stars: ✭ 133 (+46.15%)

Mutual labels: cran, rstats

Highcharter

R wrapper for highcharts

Stars: ✭ 583 (+540.66%)

Mutual labels: cran, rstats

Onnx R

R Interface to Open Neural Network Exchange (ONNX)

Stars: ✭ 31 (-65.93%)

Mutual labels: cran, rstats

Pkgsearch

Search R packages on CRAN

Stars: ✭ 73 (-19.78%)

Mutual labels: cran, rstats

View All Similar Projects ➔

refinr

refinr is designed to cluster and merge similar values within a character vector. It features two functions that are implementations of clustering algorithms from the open source software OpenRefine. The cluster methods used are key collision and ngram fingerprint (more info on these here).

In addition, there are a few add-on features included, to make the clustering/merging functions more useful. These include approximate string matching to allow for merging despite minor mispellings, the option to pass a dictionary vector to dictate edit values, and the option to pass a vector of strings to ignore during the clustering process.

Please report issues, comments, or feature requests.

Installation

Install from CRAN:

install.packages("refinr")

Or install the dev version from this repo:

# install.packages("devtools")
devtools::install_github("ChrisMuir/refinr")

Example Usage

library(refinr)

x <- c("Acme Pizza, Inc.", "ACME PIZZA COMPANY", "acme pizza LLC", "Acme Pizza, Inc.")
key_collision_merge(x)
#> [1] "Acme Pizza, Inc." "Acme Pizza, Inc." "Acme Pizza, Inc." "Acme Pizza, Inc."

A dictionary character vector can be passed to key_collision_merge, which will dictate merge values when a cluster has a match within the dict vector.

x <- c("Acme Pizza, Inc.", "ACME PIZZA COMPANY", "acme pizza LLC", "Acme Pizza, Inc.")
key_collision_merge(x, dict = c("Nicks Pizza", "acme PIZZA inc"))
#> [1] "acme PIZZA inc" "acme PIZZA inc" "acme PIZZA inc" "acme PIZZA inc"

Function n_gram_merge can be used to merge similar values that contain slight spelling differences. The stringdist package is used for calculating edit distance between strings. refinr links to the stringdist C API to improve the speed of the functions.

x <- c("Acmme Pizza, Inc.", "ACME PIZA COMPANY", "Acme Pizzazza LLC")
n_gram_merge(x, weight = c(d = 0.2, i = 0.2, s = 1, t = 1))
#> [1] "ACME PIZA COMPANY" "ACME PIZA COMPANY" "ACME PIZA COMPANY"

# The performance of the approximate string matching can be ajusted using parameters 
# "weight" and/or "edit_threshold".
n_gram_merge(x, weight = c(d = 1, i = 1, s = 0.1, t = 0.1))
#> [1] "Acme Pizzazza LLC" "ACME PIZA COMPANY" "Acme Pizzazza LLC"

Both key_collision_merge and n_gram_merge have optional arg ignore_strings, which takes a character vector of strings to be ignored during the merging of values.

x <- c("Bakersfield Highschool", "BAKERSFIELD high", "high school, bakersfield")
key_collision_merge(x, ignore_strings = c("high", "school", "highschool"))
#> [1] "BAKERSFIELD high" "BAKERSFIELD high" "BAKERSFIELD high"

The clustering is designed to be insensitive to common business name suffixes, i.e. "inc", "llc", "co", etc. This feature can be turned on/off using function parameter bus_suffix.

Workflow for checking the results of the refinr processes

library(dplyr)
library(knitr)

x <- c(
  "Clemsson University", 
  "university-of-clemson", 
  "CLEMSON", 
  "Clem son, U.", 
  "college, clemson u", 
  "M.I.T.", 
  "Technology, Massachusetts' Institute of", 
  "Massachusetts Inst of Technology", 
  "UNIVERSITY:  mit"
)

ignores <- c("university", "college", "u", "of", "institute", "inst")

x_refin <- x %>% 
  refinr::key_collision_merge(ignore_strings = ignores) %>% 
  refinr::n_gram_merge(ignore_strings = ignores)

# Create df for comparing the original values to the edited values.
# This is especially useful for larger input vectors.
inspect_results <- data_frame(original_values = x, edited_values = x_refin) %>% 
  mutate(equal = original_values == edited_values)

# Display only the values that were edited by refinr.
knitr::kable(
  inspect_results[!inspect_results$equal, c("original_values", "edited_values")]
)
#> |original_values                         |edited_values                    |
#> |:---------------------------------------|:--------------------------------|
#> |Clemsson University                     |CLEMSON                          |
#> |university-of-clemson                   |CLEMSON                          |
#> |Clem son, U.                            |CLEMSON                          |
#> |college, clemson u                      |CLEMSON                          |
#> |Technology, Massachusetts' Institute of |Massachusetts Inst of Technology |
#> |UNIVERSITY:  mit                        |M.I.T.                           |

Notes

This package is NOT meant to replace OpenRefine for every use case. For situations in which merging accuracy is the most important consideration, OpenRefine is preferable. Since the merging steps in refinr are automated, there will usually be more false positive merges, versus manually selecting clusters to merge in OpenRefine.
The advantages this package has over OpenRefine:
- Operations are fully automated.
- Facilitates a more reproducible workflow.
- Faster when working with large input data (character vectors of length 500000+).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 91

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗