All Projects → prasanthg3 → cleantext

prasanthg3 / cleantext

Licence: MIT license
An open-source package for python to clean raw text data

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to cleantext

Machine-Learning-Projects-2
No description or website provided.
Stars: ✭ 23 (-14.81%)
Mutual labels:  datacleaning
pyjanitor
Clean APIs for data cleaning. Python implementation of R package Janitor
Stars: ✭ 970 (+3492.59%)
Mutual labels:  cleaning-data
corpusexplorer2.0
Korpuslinguistik war noch nie so einfach...
Stars: ✭ 16 (-40.74%)
Mutual labels:  cleaning-data
Openrefine
OpenRefine is a free, open source power tool for working with messy data and improving it
Stars: ✭ 8,531 (+31496.3%)
Mutual labels:  datacleaning
Great expectations
Always know what to expect from your data.
Stars: ✭ 5,808 (+21411.11%)
Mutual labels:  datacleaning
HyperGBM
A full pipeline AutoML tool for tabular data
Stars: ✭ 172 (+537.04%)
Mutual labels:  datacleaning
validatedb
Validate on a table in a DB, using dbplyr
Stars: ✭ 15 (-44.44%)
Mutual labels:  datacleaning

cleantext

Downloads

cleantext is a an open-source python package to clean raw text data. Source code for the library can be found here.

Features

cleantext has two main methods,

  • clean: to clean raw text and return the cleaned text
  • clean_words: to clean raw text and return a list of clean words

cleantext can apply all, or a selected combination of the following cleaning operations:

  • Remove extra white spaces
  • Convert the entire text into a uniform lowercase
  • Remove digits from the text
  • Remove punctuations from the text
  • Remove or replace the part of text with custom regex
  • Remove stop words, and choose a language for stop words ( Stop words are generally the most common words in a language with no significant meaning such as is, am, the, this, are etc.)
  • Stem the words (Stemming is a process of converting words with similar meaning into a single word. For example, stemming of words run, runs, running will result run, run, run)

Installation

cleantext requires Python 3 and NLTK to execute.

To install using pip, use

pip install cleantext

Usage

  • Import the library:
import cleantext
  • Choose a method:

To return the text in a string format,

cleantext.clean("your_raw_text_here") 

To return a list of words from the text,

cleantext.clean_words("your_raw_text_here") 

To choose a specific set of cleaning operations,

cleantext.clean_words("your_raw_text_here",
clean_all= False # Execute all cleaning operations
extra_spaces=True ,  # Remove extra white spaces 
stemming=True , # Stem the words
stopwords=True ,# Remove stop words
lowercase=True ,# Convert to lowercase
numbers=True ,# Remove all digits 
punct=True ,# Remove all punctuations
reg: str = '<regex>', # Remove parts of text based on regex
reg_replace: str = '<replace_value>', # String to replace the regex used in reg
stp_lang='english'  # Language for stop words
)

Examples

import cleantext
cleantext.clean('This is A s$ample !!!! tExt3% to   cleaN566556+2+59*/133', extra_spaces=True, lowercase=True, numbers=True, punct=True)

returns,

'this is a sample text to clean'

import cleantext
cleantext.clean_words('This is A s$ample !!!! tExt3% to   cleaN566556+2+59*/133')

returns,

['sampl', 'text', 'clean']

from cleantext import clean
text = "my id, [email protected] and your, [email protected]"
clean(text, reg=r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", reg_replace='email', clean_all=False)

returns,

"my id, email and your, email"

License

MIT

For any questions, issues, bugs, and suggestions please visit here

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].