Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → lyeoni → Prenlp

lyeoni / Prenlp

Licence: apache-2.0

Preprocessing Library for Natural Language Processing

Programming Languages

139335 projects - #7 most used programming language

Labels

nlp natural-language-processing text-processing

Projects that are alternatives of or similar to Prenlp

Text vectorization tool to outperform TFIDF for classification tasks

Stars: ✭ 167 (+28.46%)

Mutual labels: natural-language-processing, text-processing

🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.

Stars: ✭ 130 (+0%)

Mutual labels: natural-language-processing, text-processing

Stanford NLP group's shared Python tools.

Stars: ✭ 142 (+9.23%)

Mutual labels: natural-language-processing, text-processing

Mycroft's multilingual text parsing and formatting library

Stars: ✭ 51 (-60.77%)

Mutual labels: natural-language-processing, text-processing

PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).

Stars: ✭ 426 (+227.69%)

Mutual labels: natural-language-processing, text-processing

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

Stars: ✭ 2,441 (+1777.69%)

Mutual labels: natural-language-processing, text-processing

Python library for Natural Language Preprocessing (NLPre)

Stars: ✭ 158 (+21.54%)

Mutual labels: natural-language-processing, text-processing

THE String Processing Package for R (with ICU)

Stars: ✭ 204 (+56.92%)

Mutual labels: natural-language-processing, text-processing

Open Korean Text

Open Korean Text Processor - An Open-source Korean Text Processor

Stars: ✭ 438 (+236.92%)

Mutual labels: natural-language-processing, text-processing

CogComp's light-weight Python NLP annotators

Stars: ✭ 115 (-11.54%)

Mutual labels: natural-language-processing, text-processing

Awesome Hungarian Nlp

A curated list of NLP resources for Hungarian

Stars: ✭ 121 (-6.92%)

Mutual labels: natural-language-processing

A baseline implementation for FNC-1

Stars: ✭ 123 (-5.38%)

Mutual labels: natural-language-processing

Neuraldialog Larl

PyTorch implementation of latent space reinforcement learning for E2E dialog published at NAACL 2019. It is released by Tiancheng Zhao (Tony) from Dialog Research Center, LTI, CMU

Stars: ✭ 127 (-2.31%)

Mutual labels: natural-language-processing

A Golang library for processing Asciidoc files.

Stars: ✭ 129 (-0.77%)

Mutual labels: text-processing

Machine reading comprehension on clinical case reports

Stars: ✭ 123 (-5.38%)

Mutual labels: natural-language-processing

🔮 Neuro.js is machine learning library for building AI assistants and chat-bots (WIP).

Stars: ✭ 126 (-3.08%)

Mutual labels: natural-language-processing

🎀 JavaScript API for spaCy with Python REST API

Stars: ✭ 123 (-5.38%)

Mutual labels: natural-language-processing

Calculating ROUGE score between two files (line-by-line)

Stars: ✭ 120 (-7.69%)

Mutual labels: natural-language-processing

Nlp Pretrained Model

A collection of Natural language processing pre-trained models.

Stars: ✭ 122 (-6.15%)

Mutual labels: natural-language-processing

Medical Question Answering Dataset of 47,457 QA pairs created from 12 NIH websites

Stars: ✭ 129 (-0.77%)

Mutual labels: natural-language-processing

View All Similar Projects ➔

PreNLP

Preprocessing Library for Natural Language Processing

Installation

Requirements

Python >= 3.6

Mecab morphological analyzer for Korean

sh scripts/install_mecab.sh
# Only for Mac OS users, run the code below before run install_mecab.sh script.
# export MACOSX_DEPLOYMENT_TARGET=10.10
# CFLAGS='-stdlib=libc++' pip install konlpy

C++ Build tools for fastText
- g++ >= 4.7.2 or clang >= 3.3
- For Windows, Visual Studio C++ is recommended.

With pip

prenlp can be installed using pip as follows:

pip install prenlp

Usage

Data

Dataset Loading

Popular datasets for NLP tasks are provided in prenlp. All datasets is stored in /.data directory.

Sentiment Analysis: IMDb, NSMC
Language Modeling: WikiText-2, WikiText-103, WikiText-ko, NamuWiki-ko

Dataset	Language	Articles	Sentences	Tokens	Vocab	Size
WikiText-2	English	720	-	2,551,843	33,278	13.3MB
WikiText-103	English	28,595	-	103,690,236	267,735	517.4MB
WikiText-ko	Korean	477,946	2,333,930	131,184,780	662,949	667MB
NamuWiki-ko	Korean	661,032	16,288,639	715,535,778	1,130,008	3.3GB
WikiText-ko+NamuWiki-ko	Korean	1,138,978	18,622,569	846,720,558	1,360,538	3.95GB

General use cases are as follows:

WikiText-2 / WikiText-103

>>> wikitext2 = prenlp.data.WikiText2()
>>> len(wikitext2)
3
>>> train, valid, test = prenlp.data.WikiText2()
>>> train[0]
'= Valkyria Chronicles III ='

IMDB

>>> imdb_train, imdb_test = prenlp.data.IMDB()
>>> imdb_train[0]
["Minor Spoilers<br /><br />Alison Parker (Cristina Raines) is a successful top model, living with the lawyer Michael Lerman (Chris Sarandon) in his apartment. She tried to commit ...", 'pos']

Normalization

Frequently used normalization functions for text pre-processing are provided in prenlp.

url, HTML tag, emoticon, email, phone number, etc.

General use cases are as follows:

>>> from prenlp.data import Normalizer
>>> normalizer = Normalizer(url_repl='[URL]', tag_repl='[TAG]', emoji_repl='[EMOJI]', email_repl='[EMAIL]', tel_repl='[TEL]', image_repl='[IMG]')

>>> normalizer.normalize('Visit this link for more details: https://github.com/')
'Visit this link for more details: [URL]'

>>> normalizer.normalize('Use HTML with the desired attributes: <img src="cat.jpg" height="100" />')
'Use HTML with the desired attributes: [TAG]'

>>> normalizer.normalize('Hello 🤩, I love you 💓 !')
'Hello [EMOJI], I love you [EMOJI] !'

>>> normalizer.normalize('Contact me at [email protected]')
'Contact me at [EMAIL]'

>>> normalizer.normalize('Call +82 10-1234-5678')
'Call [TEL]'

>>> normalizer.normalize('Download our logo image, logo123.png, with transparent background.')
'Download our logo image, [IMG], with transparent background.'

Tokenizer

Frequently used (subword) tokenizers for text pre-processing are provided in prenlp.

SentencePiece, NLTKMosesTokenizer, Mecab

SentencePiece

>>> from prenlp.tokenizer import SentencePiece
>>> SentencePiece.train(input='corpus.txt', model_prefix='sentencepiece', vocab_size=10000)
>>> tokenizer = SentencePiece.load('sentencepiece.model')
>>> tokenizer('Time is the most valuable thing a man can spend.')
['▁Time', '▁is', '▁the', '▁most', '▁valuable', '▁thing', '▁a', '▁man', '▁can', '▁spend', '.']
>>> tokenizer.tokenize('Time is the most valuable thing a man can spend.')
['▁Time', '▁is', '▁the', '▁most', '▁valuable', '▁thing', '▁a', '▁man', '▁can', '▁spend', '.']
>>> tokenizer.detokenize(['▁Time', '▁is', '▁the', '▁most', '▁valuable', '▁thing', '▁a', '▁man', '▁can', '▁spend', '.'])
Time is the most valuable thing a man can spend.

Moses tokenizer

>>> from prenlp.tokenizer import NLTKMosesTokenizer
>>> tokenizer = NLTKMosesTokenizer()
>>> tokenizer('Time is the most valuable thing a man can spend.')
['Time', 'is', 'the', 'most', 'valuable', 'thing', 'a', 'man', 'can', 'spend', '.']

Comparisons with tokenizers on IMDb

Below figure shows the classification accuracy from various tokenizer.

Code: NLTKMosesTokenizer, SentencePiece

Comparisons with tokenizers on NSMC (Korean IMDb)

Below figure shows the classification accuracy from various tokenizer.

Code: Mecab, SentencePiece

Author

Hoyeon Lee @lyeoni
email : [email protected]
facebook : https://www.facebook.com/lyeoni.f

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 130

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗