All Projects → iaramer → dobbi

iaramer / dobbi

Licence: Apache-2.0 license
An open-source NLP library: fast text cleaning and preprocessing

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to dobbi

html-comment-regex
Regular expression for matching HTML comments
Stars: ✭ 15 (-28.57%)
Mutual labels:  string, text, regexp
stringx
Drop-in replacements for base R string functions powered by stringi
Stars: ✭ 14 (-33.33%)
Mutual labels:  text, regexp
textics
📉 JavaScript Text Statistics that counts lines, words, chars, and spaces.
Stars: ✭ 36 (+71.43%)
Mutual labels:  string, text
Stringi
THE String Processing Package for R (with ICU)
Stars: ✭ 204 (+871.43%)
Mutual labels:  text, regexp
regXwild
⏱ Superfast ^Advanced wildcards++? | Unique algorithms that was implemented on native unmanaged C++ but easily accessible in .NET via Conari (with caching of 0x29 opcodes +optimizations) etc.
Stars: ✭ 20 (-4.76%)
Mutual labels:  text, regexp
Prestyler
Elegant text formatting tool in Swift 🔥
Stars: ✭ 36 (+71.43%)
Mutual labels:  string, text
subst
Search and des... argh... replace in many files at once. Use regexp and power of Python to replace what you want.
Stars: ✭ 20 (-4.76%)
Mutual labels:  text, regexp
String.prototype.matchAll
Spec-compliant polyfill for String.prototype.matchAll, in ES2020
Stars: ✭ 14 (-33.33%)
Mutual labels:  string, regexp
Attributedstring
基于Swift插值方式优雅的构建富文本, 支持点击长按事件, 支持不同类型过滤, 支持自定义视图等.
Stars: ✭ 294 (+1300%)
Mutual labels:  string, text
justified
Wrap, align and justify the words in a string.
Stars: ✭ 30 (+42.86%)
Mutual labels:  string, text
TairString
A redis module, similar to redis string, but you can set expire and version for the value. It also provides many very useful commands, such as cas/cad, etc.
Stars: ✭ 99 (+371.43%)
Mutual labels:  string
mqtt-match
Match mqtt formatted topic strings
Stars: ✭ 19 (-9.52%)
Mutual labels:  regexp
DataTypes
Built-in data types
Stars: ✭ 34 (+61.9%)
Mutual labels:  string
node-red-contrib-string
Provides a string manipulation node with a chainable UI based on the concise and lightweight stringjs.com.
Stars: ✭ 15 (-28.57%)
Mutual labels:  string
parse-author
Parse a person, author, contributor or maintainer string into an object with name, email and url properties following NPM conventions. Useful for the `authors` property in package.json or for parsing an AUTHORS file into an array of person objects.
Stars: ✭ 23 (+9.52%)
Mutual labels:  string
carbon-preprocess-svelte
Collection of Svelte preprocessors for the Carbon Design System
Stars: ✭ 39 (+85.71%)
Mutual labels:  preprocess
vesdk-android-demo
VideoEditor SDK: A fully customizable video editor for your app.
Stars: ✭ 90 (+328.57%)
Mutual labels:  text
selecton-extension
Selecton provides popup with actions on text selection in all major browsers
Stars: ✭ 36 (+71.43%)
Mutual labels:  text
Last-Launcher
Lightweight: Faster than light, Low on memory
Stars: ✭ 148 (+604.76%)
Mutual labels:  text
probabilistic nlg
Tensorflow Implementation of Stochastic Wasserstein Autoencoder for Probabilistic Sentence Generation (NAACL 2019).
Stars: ✭ 28 (+33.33%)
Mutual labels:  text

🌴 dobbi 🦕

Takes care of all of this boring NLP stuff

PyPI - Python Version Version GitHub

Description

An open-source NLP library: fast text cleaning and preprocessing.

TL;DR

This library provides a quick and ready-to-use text preprocessing tools for text cleaning and normalization. You can simply remove hashtags, nicknames, emoji, url addresses, punctuation, whitespace and whatever.

Installation

To download dobbi, either fork this GitHub repo or simply use Pypi via pip:

$ pip install dobbi

Usage

Import the library:

import dobbi

Interaction

The library uses method chaining in order to simplify text processing:

import pandas as pd

d = {'text': ['#fun #lol   Why  @Alex33 is so funny here: https://some-url.com',
              '#looool     =)      😍 such lovely!?*!!!%&']}
df = pd.DataFrame(d)

cln_func = dobbi.clean() \
    .hashtag() \
    .nickname() \
    .url() \
    .function()
df['text'] = df['text'].map(cln_func)

repl_func = dobbi.replace() \
    .emoji() \
    .emoticon() \
    .punctuation() \
    .function()
df['text'] = df['text'].map(repl_func)

Result:

print(df['text'][0])  # 'Why is so funny here'
print(df['text'][1])  # 'TOKEN_EMOTICON_HAPPY_FACE_OR_SMILEY TOKEN_EMOJI_SMILING_FACE_WITH_HEART_EYES such lovely'

Supported methods and patterns

The process consists of three stages:

  1. Initialization methods: initialize a dobbi Work object
  2. Intermediate methods: chain patterns in the needed order
  3. Terminal methods: choose if you need a function or a result

Initialization functions:

  • dobbi.clean()
  • dobbi.collect()
  • dobbi.replace()

Intermediate methods (pattern processing choice):

  • regexp() - custom regular expressions
  • url() - URLs
  • html() - HTML and "<...>" type markups
  • punctuation() - punctuation
  • hashtag() - hashtags
  • emoji() - emoji
  • emoticons() - emoticons
  • whitespace() - any type of whitespaces
  • nickname() - @-starting nicknames

Terminal methods:

  • execute(str) - executes chosen methods on the provided string.
  • function() - returns a function which is a combination of the chosen methods.

Examples

1) Clean a random Twitter message

dobbi.clean() \
    .hashtag() \
    .nickname() \
    .url() \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why is so funny? Check here:'

2) Replace nicknames and urls with tokens

dobbi.replace() \
    .hashtag('') \
    .nickname() \
    .url('__CUSTOM_URL_TOKEN__') \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why TOKEN_NICKNAME is so funny? Check here: __CUSTOM_URL_TOKEN__'

3) Get the text cleanup function

func = dobbi.clean() \
    .url() \
    .hashtag() \
    .punctuation() \
    .whitespace() \
    .html() \
    .function()
func('\t #fun #lol    Why  @Alex33 is so... funny? <tag> \nCheck\there: https://some-url.com')

Result:

'Why Alex33 is so funny Check here'
  1. Chain regexp methods
dobbi.clean() \
    .regexp('#\w+') \
    .regexp('@\w+') \
    .regexp('https?://\S+') \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why is so funny? Check here:'
  1. Remove emoji and emoticons
em_func = dobbi.clean() \
    .emoji() \
    .emoticon() \
    .punctuation() \
    .function()
em_func('Great! =) :D  😍 😋such lovely!?*!!!%&')

Result:

'Great such lovely'

Additional

Please pay attention that the functions are applied in the order you've specified them. So, you're better to chain .punctuation() as one of the last functions.

Call for collaboration 🤗

If you enjoyed the project I would be grateful if you supported it :)

Below is the list of useful features I would be happy to share with you:

  • Finding bugs
  • Making code optimizations
  • Writing tests
  • Help with new features development
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].