All Projects โ†’ shivam5992 โ†’ Dupandas

shivam5992 / Dupandas

๐Ÿ“Š python package for performing deduplication using flexible text matching and cleaning in pandas dataframe

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Dupandas

Modin
Modin: Speed up your Pandas workflows by changing a single line of code
Stars: โœญ 6,639 (+33095%)
Mutual labels:  pandas
Finta
Common financial technical indicators implemented in Pandas.
Stars: โœญ 901 (+4405%)
Mutual labels:  pandas
Pyda 2e Zh
๐Ÿ“– [่ฏ‘] ๅˆฉ็”จ Python ่ฟ›่กŒๆ•ฐๆฎๅˆ†ๆž ยท ็ฌฌ 2 ็‰ˆ
Stars: โœญ 866 (+4230%)
Mutual labels:  pandas
Jdupes
A powerful duplicate file finder and an enhanced fork of 'fdupes'.
Stars: โœญ 790 (+3850%)
Mutual labels:  deduplication
Quickviz
Visualize a pandas dataframe in a few clicks
Stars: โœญ 18 (-10%)
Mutual labels:  pandas
Python Introducing Pandas
Introduction to pandas Treehouse course
Stars: โœญ 24 (+20%)
Mutual labels:  pandas
Machine Learning
๋จธ์‹ ๋Ÿฌ๋‹ ์ž…๋ฌธ์ž ํ˜น์€ ์Šคํ„ฐ๋””๋ฅผ ์ค€๋น„ํ•˜์‹œ๋Š” ๋ถ„๋“ค์—๊ฒŒ ๋„์›€์ด ๋˜๊ณ ์ž ๋งŒ๋“  repository์ž…๋‹ˆ๋‹ค. (This repository is intented for helping whom are interested in machine learning study)
Stars: โœญ 705 (+3425%)
Mutual labels:  pandas
Yelp dataset challenge
Play around with Yelp dataset in Python (in progress and very messy repo)
Stars: โœญ 15 (-25%)
Mutual labels:  pandas
Borgmatic
Simple, configuration-driven backup software for servers and workstations
Stars: โœญ 902 (+4410%)
Mutual labels:  deduplication
Disatbot
DABOT: Disaster Attention Bot
Stars: โœญ 26 (+30%)
Mutual labels:  pandas
Dataframe
C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types, continuous memory storage, and no pointers are involved
Stars: โœญ 828 (+4040%)
Mutual labels:  pandas
Foxcross
AsyncIO serving for data science models
Stars: โœญ 18 (-10%)
Mutual labels:  pandas
S3bp
Read and write Python objects to S3, caching them on your hard drive to avoid unnecessary IO.
Stars: โœญ 24 (+20%)
Mutual labels:  pandas
Pandas exercises
Practice your pandas skills!
Stars: โœญ 7,140 (+35600%)
Mutual labels:  pandas
Pandas Profiling
Create HTML profiling reports from pandas DataFrame objects
Stars: โœญ 8,329 (+41545%)
Mutual labels:  pandas
Fecon235
Notebooks for financial economics. Keywords: Jupyter notebook pandas Federal Reserve FRED Ferbus GDP CPI PCE inflation unemployment wage income debt Case-Shiller housing asset portfolio equities SPX bonds TIPS rates currency FX euro EUR USD JPY yen XAU gold Brent WTI oil Holt-Winters time-series forecasting statistics econometrics
Stars: โœญ 708 (+3440%)
Mutual labels:  pandas
Boltzmannclean
Fill missing values in Pandas DataFrames using Restricted Boltzmann Machines
Stars: โœญ 23 (+15%)
Mutual labels:  pandas
Kodiak
Enhance your feature engineering workflow with Kodiak
Stars: โœญ 20 (+0%)
Mutual labels:  pandas
Numsharp
High Performance Computation for N-D Tensors in .NET, similar API to NumPy.
Stars: โœญ 882 (+4310%)
Mutual labels:  pandas
Phildb
Timeseries database
Stars: โœญ 25 (+25%)
Mutual labels:  pandas

dupandas: data deduplication of text records in a pandas dataframe

Project Status: WIP โ€“ Initial development is in progress, but there has not yet been a stable, usable release suitable for the public. Twitter Follow

dupandas is a python package to perform data deduplication on columns of a pandas dataframe using flexible text matching. It is compatible with both versions of python (2.x and 3.x). dupandas can find duplicate any kinds of text records in the pandas data. It comprises of sophisticated Matchers that can handle spelling differences and phonetics. It also comprises of several Cleaners, which can be used to clean up the noise present in the text data such as punctuations, digits, casing etc.

For fast computations, dupandas uses lucene based text indexing. In the input_config, if "indexing" = True, then it indexes the dataset in RAMDirectory which is used to identify and search similar strings. Check out the instructions of installing PyLucene below.

The beautiful part of dupandas is that it's Matchers, Cleaners and Indexing functions can be used as standalone packages while working with text data.

Installation

Following python modules are required to use dupandas: pandas, fuzzy, python-levenshtein . These modules can be installed using pip command:

    pip install dupandas pandas fuzzy python-levenshtein

OR if dependencies are already installed:

    pip install dupandas

OPTIONAL For faster implementation dupandas with indexing feature is recommended. dupandas uses PuLucene for data indexing purposes.
PyLucene Installation: Please note that for lucene indexing, java needs to be installed. Java 8 is recommended. Refer to this link

    sudo apt-get update
    sudo apt-get install pylucene

    After Installation, edit ~/.bashrc file, and add the following line at the end 
    export LD_LIBRARY_PATH=/usr/lib/jvm/java_folder_name/jre/lib/amd64/server
    
    example: export LD_LIBRARY_PATH=/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server

Note: The use of indexing can reduce the overall time of computation and execution to one third of original.

Usage : dupandas

dupandas using default Matchers and Cleaners, Default Matcher and Cleaners are Exact Match and No Cleaning respectively.

    from dupandas import Dedupe
    dupe = Dedupe()
    
    input_config = {
        'input_data' : pandas_dataframe,
        'column' : 'column_name_to_deduplicate',
        '_id' : 'unique_id_column_of_dataset',
        }
    results = dupe.dedupe(input_config)

dupandas using custom Cleaner and Matcher configs

    from dupandas import Dedupe

    clean_config = { 'lower' : True, 'punctuation' : True, 'whitespace' : True, 'digit' : True }
    match_config = { 'exact' : False, 'levenshtein' : True, 'soundex' : False, 'nysiis' : False}
    dupe = Dedupe(clean_config = clean_config, match_config = match_config)

    input_config = {
        'input_data' : pandas_dataframe,
        'column' : 'column_name_to_deduplicate',
        '_id' : 'unique_id_column_of_dataset',
        }
    results = dupe.dedupe(input_config)

Other options in input_config

    input_config = {
        'input_data' : pandas_dataframe,
        'column' : 'column_name_to_deduplicate',
        '_id' : 'unique_id_column_of_dataset',
        'score_column' : 'name_of_the_column_for_confidence_score',
        'threshold' : 0.75, # float value of threshold
        'unique_pairs' : True, # boolean to get unique (A=B) or duplicate (A=B and B=A) results
        'indexing' : False # Boolean to set lucene indexing = True / False, Default: False
        }

Usage : standalone Cleaner class

    from dupandas import Cleaner
    clean_config = { 'lower' : True, 'punctuation' : True, 'whitespace' : True, 'digit' : True }
    clean = Cleaner(clean_config)
    clean.clean_text("new Delhi 3#! 34 ")

Usage: standalone Matcher class

    from dupandas import Matcher
    match_config = { 'exact' : False, 'levenshtein' : True, 'soundex' : False, 'nysiis' : False}
    match = Matcher(match_config)
    match.match_elements("new delhi", "newdeli")

Issues

Thanks for checking this work, Yes ofcourse there is a scope of improvement, Feel free to submit issues and enhancement requests.

Contributing

ToDos

  1. [ ] V2: Add Support for multi column match
  2. [ ] V2: Add Matchers, Cleaners
  3. [ ] V2: Remove Library Dependencies
  4. [ ] V2: Handle Longer Texts, Command Line Arguments

Steps

  1. Fork the repo on GitHub
  2. Clone the project to your own machine
  3. Commit changes to your own branch
  4. Push your work back up to your fork
  5. Submit a Pull request
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].