Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → NIHOPA → Nlpre

NIHOPA / Nlpre

Python library for Natural Language Preprocessing (NLPre)

Programming Languages

139335 projects - #7 most used programming language

Labels

nlp natural-language-processing text-processing

Projects that are alternatives of or similar to Nlpre

PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).

Stars: ✭ 426 (+169.62%)

Mutual labels: natural-language-processing, text-processing

THE String Processing Package for R (with ICU)

Stars: ✭ 204 (+29.11%)

Mutual labels: natural-language-processing, text-processing

Text vectorization tool to outperform TFIDF for classification tasks

Stars: ✭ 167 (+5.7%)

Mutual labels: natural-language-processing, text-processing

Preprocessing Library for Natural Language Processing

Stars: ✭ 130 (-17.72%)

Mutual labels: natural-language-processing, text-processing

CogComp's light-weight Python NLP annotators

Stars: ✭ 115 (-27.22%)

Mutual labels: natural-language-processing, text-processing

Open Korean Text

Open Korean Text Processor - An Open-source Korean Text Processor

Stars: ✭ 438 (+177.22%)

Mutual labels: natural-language-processing, text-processing

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

Stars: ✭ 2,441 (+1444.94%)

Mutual labels: natural-language-processing, text-processing

Mycroft's multilingual text parsing and formatting library

Stars: ✭ 51 (-67.72%)

Mutual labels: natural-language-processing, text-processing

🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.

Stars: ✭ 130 (-17.72%)

Mutual labels: natural-language-processing, text-processing

Stanford NLP group's shared Python tools.

Stars: ✭ 142 (-10.13%)

Mutual labels: natural-language-processing, text-processing

Chemdataextractor

Automatically extract chemical information from scientific documents

Stars: ✭ 152 (-3.8%)

Mutual labels: natural-language-processing

Paraphrase identification

Examine two sentences and determine whether they have the same meaning.

Stars: ✭ 154 (-2.53%)

Mutual labels: natural-language-processing

Speech signal processing and classification

Front-end speech processing aims at extracting proper features from short- term segments of a speech utterance, known as frames. It is a pre-requisite step toward any pattern recognition problem employing speech or audio (e.g., music). Here, we are interesting in voice disorder classification. That is, to develop two-class classifiers, which can discriminate between utterances of a subject suffering from say vocal fold paralysis and utterances of a healthy subject.The mathematical modeling of the speech production system in humans suggests that an all-pole system function is justified [1-3]. As a consequence, linear prediction coefficients (LPCs) constitute a first choice for modeling the magnitute of the short-term spectrum of speech. LPC-derived cepstral coefficients are guaranteed to discriminate between the system (e.g., vocal tract) contribution and that of the excitation. Taking into account the characteristics of the human ear, the mel-frequency cepstral coefficients (MFCCs) emerged as descriptive features of the speech spectral envelope. Similarly to MFCCs, the perceptual linear prediction coefficients (PLPs) could also be derived. The aforementioned sort of speaking tradi- tional features will be tested against agnostic-features extracted by convolu- tive neural networks (CNNs) (e.g., auto-encoders) [4]. The pattern recognition step will be based on Gaussian Mixture Model based classifiers,K-nearest neighbor classifiers, Bayes classifiers, as well as Deep Neural Networks. The Massachussets Eye and Ear Infirmary Dataset (MEEI-Dataset) [5] will be exploited. At the application level, a library for feature extraction and classification in Python will be developed. Credible publicly available resources will be 1used toward achieving our goal, such as KALDI. Comparisons will be made against [6-8].

Stars: ✭ 155 (-1.9%)

Mutual labels: natural-language-processing

SLING - A natural language frame semantics parser

Stars: ✭ 1,892 (+1097.47%)

Mutual labels: natural-language-processing

A Library to parse natural language in pure Clojure and ClojureScript

Stars: ✭ 152 (-3.8%)

Mutual labels: natural-language-processing

Repository for paper "SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference"

Stars: ✭ 156 (-1.27%)

Mutual labels: natural-language-processing

Crf Layer On The Top Of Bilstm

The CRF Layer was implemented by using Chainer 2.0. Please see more details here: https://createmomo.github.io/2017/09/12/CRF_Layer_on_the_Top_of_BiLSTM_1/

Stars: ✭ 148 (-6.33%)

Mutual labels: natural-language-processing

Chinese Biomedical Language Understanding Evaluation benchmark (ChineseBLUE)

Stars: ✭ 149 (-5.7%)

Mutual labels: natural-language-processing

Finnlp Progress

NLP progress in Fintech. A repository to track the progress in Natural Language Processing (NLP) related to the domain of Finance, including the datasets, papers, and current state-of-the-art results for the most popular tasks.

Stars: ✭ 148 (-6.33%)

Mutual labels: natural-language-processing

📖 A curated list of resources dedicated to Natural Language Processing (NLP)

Stars: ✭ 12,626 (+7891.14%)

Mutual labels: natural-language-processing

View All Similar Projects ➔

Natural Language Preprocessing (NLPre)

Major version update! NLPre 2.0.0

Backend NLP engine pattern.en has been replaced with spaCy v 2.1.0. This is a major fix for some of the problems with pattern.en including poor lemmatization. (eg. cytokine -> cytocow)
Support for python 2 has been dropped
Support for custom dictionaries in replace_from_dictionary
Option for suffix to be used instead of prefix in replace_from_dictionary
URL replacement can now remove emails
token_replacement can remove symbols

NLPre is a text (pre)-processing library that helps smooth some of the inconsistencies found in real-world data. Correcting for issues like random capitalization patterns, strange hyphenations, and abbreviations are essential parts of wrangling textual data but are often left to the user.

While this library was developed by the Office of Portfolio Analysis at the National Institutes of Health to correct for historical artifacts in our data, we envision this module to encompass a broad spectrum of problems encountered in the preprocessing step of natural language processing.

NLPre is part of the word2vec-pipeline.

Installation

For the latest release, use

pip install nlpre

If installing the python 3 version on Ubuntu, you may need to use

sudo apt-get install libmysqlclient-dev

Example

from nlpre import titlecaps, dedash, identify_parenthetical_phrases
from nlpre import replace_acronyms, replace_from_dictionary

text = ("LYMPHOMA SURVIVORS IN KOREA. Describe the correlates of unmet needs "
        "among non-Hodgkin lymphoma (NHL) surv- ivors in Korea and identify "
        "NHL patients with an abnormal white blood cell count.")

ABBR = identify_parenthetical_phrases()(text)
parsers = [dedash(), titlecaps(), replace_acronyms(ABBR),
           replace_from_dictionary(prefix="MeSH_")]

for f in parsers:
    text = f(text)

print(text)

''' lymphoma survivors in korea .
    Describe the correlates of unmet needs among non_Hodgkin_lymphoma
    ( non_Hodgkin_lymphoma ) survivors in Korea and identify non_Hodgkin_lymphoma
    patients with an abnormal MeSH_Leukocyte_Count . '''

A longer example highlighting a "pipeline" of changes can be found here.

To see a detailed log of the changes made, set the level to logging.INFO or logging.DEBUG,

import nlpre, logging
nlpre.logger.setLevel(logging.INFO)

What's included?

Function	Description
replace_from_dictionary	Replace phrases from an input dictionary. The replacement is done without regard to case, but punctuation is handled correctly. The MeSH (Medical Subject Headings) dictionary is built-in. `(11-Dimethylethyl)-4-methoxyphenol is great` `MeSH_Butylated_Hydroxyanisole is great`
replace_acronyms	Replaces acronyms and abbreviations found in a document with their corresponding phrase. If an acronym is explicitly identified with a phrase in a document, then all instances of that acronym in the document will be replaced with the given phrase. If there is no explicit indication what the phrase is within the document, then the most common phrase associated with the acronym in the given counter is used. `The EPA protects trees` `The Environmental_Protection_Agency protects trees`
identify_parenthetical_phrases	Identify abbreviations of phrases found in a parenthesis. Returns a counter and can be passed directly into `replace_acronyms`. `'Environmental Protection Agency (EPA)` `Counter((('Environmental', 'Protection', 'Agency'), 'EPA'):1)`
separated_parenthesis	Separates parenthetical content into new sentences. This is useful when creating word embeddings, as associations should only be made within the same sentence. Terminal punctuation of a period is added to parenthetical sentences if necessary. `Hello (it is a beautiful day) world.` `Hello world. it is a beautiful day .`
pos_tokenizer	Removes all words that are of a designated part-of-speech (POS) from a document. For example, when processing medical text, it is useful to remove all words that are not nouns or adjectives. POS detection is provided by the `spaCy` module. `The boy threw the ball into the yard` `boy ball yard`
unidecoder	Converts Unicode phrases into ASCII equivalent. `α-Helix β-sheet` `a-Helix b-sheet`
dedash	Hyphenations are sometimes erroneously inserted when text is passed through a word-processor. This module attempts to correct the hyphenation pattern by joining words that if they appear in an English word list. `How is the treat- ment going` `How is the treatment going`
decaps_text	We presume that case is important, but only when it differs from title case. This class normalizes capitalization patterns. `James and Sally had a fMRI` `james and sally had a fMRI`
titlecaps	Documents sometimes have sentences that are entirely in uppercase (commonly found in titles and abstracts of older documents). This parser identifies sentences where every word is uppercase, and returns the document with these sentences converted to lowercase. `ON THE STRUCTURE OF WATER.` `On the structure of water .`
token_replacement	Simple token replacement. `Observed > 20%` `Observed greater-than 20 percent`
separate_reference	Separates and optionally removes references that have been concatenated onto words. `Key feature of interleukin-1 in Drosophila3-5 and elegans(7).` `Key feature of interleukin-1 in Drosophila and elegans .`
url_replacement	Removes or replaces URLs `The source code is [here](www.github.com/NIHOPA/NLPre/).` `The source code is [here](LINK).`

Citations and Acknowledgments

He, Jian and Chaomei Chen. Predictive Effects of Novelty Measured by Temporal Embeddings on the Growth of Scientific Literature. Frontiers in Research Metrics and Analytics, 3, 9. (2018).
He, Jian and Chaomei Chen. Temporal Representations of Citations for Understanding the Changing Roles of Scientific Publications. Front. Res. Metr. Anal. (2018).
Galea, Dieter et al. Sub-word information in pre-trained biomedical word representations: evaluation and hyper-parameter optimization. BioNLP (2018).

Contributors

License

This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 158

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗