All Projects → lyeoni → Prenlp

lyeoni / Prenlp

Licence: apache-2.0
Preprocessing Library for Natural Language Processing

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Prenlp

Textvec
Text vectorization tool to outperform TFIDF for classification tasks
Stars: ✭ 167 (+28.46%)
Mutual labels:  natural-language-processing, text-processing
Konoha
🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.
Stars: ✭ 130 (+0%)
Mutual labels:  natural-language-processing, text-processing
Stanza Old
Stanford NLP group's shared Python tools.
Stars: ✭ 142 (+9.23%)
Mutual labels:  natural-language-processing, text-processing
Lingua Franca
Mycroft's multilingual text parsing and formatting library
Stars: ✭ 51 (-60.77%)
Mutual labels:  natural-language-processing, text-processing
Pynlpl
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
Stars: ✭ 426 (+227.69%)
Mutual labels:  natural-language-processing, text-processing
Fastnlp
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
Stars: ✭ 2,441 (+1777.69%)
Mutual labels:  natural-language-processing, text-processing
Nlpre
Python library for Natural Language Preprocessing (NLPre)
Stars: ✭ 158 (+21.54%)
Mutual labels:  natural-language-processing, text-processing
Stringi
THE String Processing Package for R (with ICU)
Stars: ✭ 204 (+56.92%)
Mutual labels:  natural-language-processing, text-processing
Open Korean Text
Open Korean Text Processor - An Open-source Korean Text Processor
Stars: ✭ 438 (+236.92%)
Mutual labels:  natural-language-processing, text-processing
Cogcomp Nlpy
CogComp's light-weight Python NLP annotators
Stars: ✭ 115 (-11.54%)
Mutual labels:  natural-language-processing, text-processing
Awesome Hungarian Nlp
A curated list of NLP resources for Hungarian
Stars: ✭ 121 (-6.92%)
Mutual labels:  natural-language-processing
Fnc 1 Baseline
A baseline implementation for FNC-1
Stars: ✭ 123 (-5.38%)
Mutual labels:  natural-language-processing
Neuraldialog Larl
PyTorch implementation of latent space reinforcement learning for E2E dialog published at NAACL 2019. It is released by Tiancheng Zhao (Tony) from Dialog Research Center, LTI, CMU
Stars: ✭ 127 (-2.31%)
Mutual labels:  natural-language-processing
Libasciidoc
A Golang library for processing Asciidoc files.
Stars: ✭ 129 (-0.77%)
Mutual labels:  text-processing
Clicr
Machine reading comprehension on clinical case reports
Stars: ✭ 123 (-5.38%)
Mutual labels:  natural-language-processing
Neuro
🔮 Neuro.js is machine learning library for building AI assistants and chat-bots (WIP).
Stars: ✭ 126 (-3.08%)
Mutual labels:  natural-language-processing
Spacy Js
🎀 JavaScript API for spaCy with Python REST API
Stars: ✭ 123 (-5.38%)
Mutual labels:  natural-language-processing
Files2rouge
Calculating ROUGE score between two files (line-by-line)
Stars: ✭ 120 (-7.69%)
Mutual labels:  natural-language-processing
Nlp Pretrained Model
A collection of Natural language processing pre-trained models.
Stars: ✭ 122 (-6.15%)
Mutual labels:  natural-language-processing
Medquad
Medical Question Answering Dataset of 47,457 QA pairs created from 12 NIH websites
Stars: ✭ 129 (-0.77%)
Mutual labels:  natural-language-processing

PreNLP

PyPI License GitHub stars GitHub forks

Preprocessing Library for Natural Language Processing

Installation

Requirements

  • Python >= 3.6
  • Mecab morphological analyzer for Korean
    sh scripts/install_mecab.sh
    # Only for Mac OS users, run the code below before run install_mecab.sh script.
    # export MACOSX_DEPLOYMENT_TARGET=10.10
    # CFLAGS='-stdlib=libc++' pip install konlpy
    
  • C++ Build tools for fastText

With pip

prenlp can be installed using pip as follows:

pip install prenlp

Usage

Data

Dataset Loading

Popular datasets for NLP tasks are provided in prenlp. All datasets is stored in /.data directory.

  • Sentiment Analysis: IMDb, NSMC
  • Language Modeling: WikiText-2, WikiText-103, WikiText-ko, NamuWiki-ko
Dataset Language Articles Sentences Tokens Vocab Size
WikiText-2 English 720 - 2,551,843 33,278 13.3MB
WikiText-103 English 28,595 - 103,690,236 267,735 517.4MB
WikiText-ko Korean 477,946 2,333,930 131,184,780 662,949 667MB
NamuWiki-ko Korean 661,032 16,288,639 715,535,778 1,130,008 3.3GB
WikiText-ko+NamuWiki-ko Korean 1,138,978 18,622,569 846,720,558 1,360,538 3.95GB

General use cases are as follows:

WikiText-2 / WikiText-103
>>> wikitext2 = prenlp.data.WikiText2()
>>> len(wikitext2)
3
>>> train, valid, test = prenlp.data.WikiText2()
>>> train[0]
'= Valkyria Chronicles III ='
IMDB
>>> imdb_train, imdb_test = prenlp.data.IMDB()
>>> imdb_train[0]
["Minor Spoilers<br /><br />Alison Parker (Cristina Raines) is a successful top model, living with the lawyer Michael Lerman (Chris Sarandon) in his apartment. She tried to commit ...", 'pos']

Normalization

Frequently used normalization functions for text pre-processing are provided in prenlp.

url, HTML tag, emoticon, email, phone number, etc.

General use cases are as follows:

>>> from prenlp.data import Normalizer
>>> normalizer = Normalizer(url_repl='[URL]', tag_repl='[TAG]', emoji_repl='[EMOJI]', email_repl='[EMAIL]', tel_repl='[TEL]', image_repl='[IMG]')

>>> normalizer.normalize('Visit this link for more details: https://github.com/')
'Visit this link for more details: [URL]'

>>> normalizer.normalize('Use HTML with the desired attributes: <img src="cat.jpg" height="100" />')
'Use HTML with the desired attributes: [TAG]'

>>> normalizer.normalize('Hello 🤩, I love you 💓 !')
'Hello [EMOJI], I love you [EMOJI] !'

>>> normalizer.normalize('Contact me at [email protected]')
'Contact me at [EMAIL]'

>>> normalizer.normalize('Call +82 10-1234-5678')
'Call [TEL]'

>>> normalizer.normalize('Download our logo image, logo123.png, with transparent background.')
'Download our logo image, [IMG], with transparent background.'

Tokenizer

Frequently used (subword) tokenizers for text pre-processing are provided in prenlp.

SentencePiece, NLTKMosesTokenizer, Mecab

SentencePiece

>>> from prenlp.tokenizer import SentencePiece
>>> SentencePiece.train(input='corpus.txt', model_prefix='sentencepiece', vocab_size=10000)
>>> tokenizer = SentencePiece.load('sentencepiece.model')
>>> tokenizer('Time is the most valuable thing a man can spend.')
['▁Time', '▁is', '▁the', '▁most', '▁valuable', '▁thing', '▁a', '▁man', '▁can', '▁spend', '.']
>>> tokenizer.tokenize('Time is the most valuable thing a man can spend.')
['▁Time', '▁is', '▁the', '▁most', '▁valuable', '▁thing', '▁a', '▁man', '▁can', '▁spend', '.']
>>> tokenizer.detokenize(['▁Time', '▁is', '▁the', '▁most', '▁valuable', '▁thing', '▁a', '▁man', '▁can', '▁spend', '.'])
Time is the most valuable thing a man can spend.

Moses tokenizer

>>> from prenlp.tokenizer import NLTKMosesTokenizer
>>> tokenizer = NLTKMosesTokenizer()
>>> tokenizer('Time is the most valuable thing a man can spend.')
['Time', 'is', 'the', 'most', 'valuable', 'thing', 'a', 'man', 'can', 'spend', '.']

Comparisons with tokenizers on IMDb

Below figure shows the classification accuracy from various tokenizer.

Comparisons with tokenizers on NSMC (Korean IMDb)

Below figure shows the classification accuracy from various tokenizer.

Author

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].