All Projects → bjascob → Lemminflect

bjascob / Lemminflect

Licence: mit
A python module for English lemmatization and inflection.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Lemminflect

topic modelling financial news
Topic modelling on financial news with Natural Language Processing
Stars: ✭ 51 (-51.43%)
Mutual labels:  spacy, nlp-machine-learning
Quora QuestionPairs DL
Kaggle Competition: Using deep learning to solve quora's question pairs problem
Stars: ✭ 54 (-48.57%)
Mutual labels:  spacy, nlp-machine-learning
alter-nlu
Natural language understanding library for chatbots with intent recognition and entity extraction.
Stars: ✭ 45 (-57.14%)
Mutual labels:  spacy, nlp-machine-learning
Cracking The Da Vinci Code With Google Interview Problems And Nlp In Python
A guide on how to crack combinatorics puzzles shown in The Da Vinci Code movie using CS fundamentals and NLP
Stars: ✭ 75 (-28.57%)
Mutual labels:  nlp-machine-learning
Spacy Graphql
🤹‍♀️ Query spaCy's linguistic annotations using GraphQL
Stars: ✭ 81 (-22.86%)
Mutual labels:  spacy
Writeup Frontend
Beat Writer's Block with AI
Stars: ✭ 94 (-10.48%)
Mutual labels:  nlp-machine-learning
Repo 2016
R, Python and Mathematica Codes in Machine Learning, Deep Learning, Artificial Intelligence, NLP and Geolocation
Stars: ✭ 103 (-1.9%)
Mutual labels:  nlp-machine-learning
Python nlp tutorial
This repository provides everything to get started with Python for Text Mining / Natural Language Processing (NLP)
Stars: ✭ 72 (-31.43%)
Mutual labels:  spacy
Question Generation
Given a sentence automatically generate reading comprehension style factual questions from that sentence, such that the sentence contains answers to those questions.
Stars: ✭ 100 (-4.76%)
Mutual labels:  nlp-machine-learning
Doc2vec
📓 Long(er) text representation and classification using Doc2Vec embeddings
Stars: ✭ 92 (-12.38%)
Mutual labels:  nlp-machine-learning
Tageditor
🏖TagEditor - Annotation tool for spaCy
Stars: ✭ 92 (-12.38%)
Mutual labels:  spacy
Summarus
Models for automatic abstractive summarization
Stars: ✭ 83 (-20.95%)
Mutual labels:  nlp-machine-learning
Wiki Split
One million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.
Stars: ✭ 95 (-9.52%)
Mutual labels:  nlp-machine-learning
Russian news corpus
Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ
Stars: ✭ 76 (-27.62%)
Mutual labels:  nlp-machine-learning
Codesearchnet
Datasets, tools, and benchmarks for representation learning of code.
Stars: ✭ 1,378 (+1212.38%)
Mutual labels:  nlp-machine-learning
Dframcy
Dataframe Integration with spaCy.
Stars: ✭ 74 (-29.52%)
Mutual labels:  spacy
Jupyterlab Prodigy
🧬 A JupyterLab extension for annotating data with Prodigy
Stars: ✭ 97 (-7.62%)
Mutual labels:  spacy
Lda Topic Modeling
A PureScript, browser-based implementation of LDA topic modeling.
Stars: ✭ 91 (-13.33%)
Mutual labels:  nlp-machine-learning
Excelcy
Excel Integration with spaCy. Training NER using Excel/XLSX from PDF, DOCX, PPT, PNG or JPG.
Stars: ✭ 89 (-15.24%)
Mutual labels:  spacy
Datascience
It consists of examples, assignments discussed in data science course taken at algorithmica.
Stars: ✭ 92 (-12.38%)
Mutual labels:  nlp-machine-learning

(icon)   LemmInflect

A python module for English lemmatization and inflection.

About

LemmInflect uses a dictionary approach to lemmatize English words and inflect them into forms specified by a user supplied Universal Dependencies or Penn Treebank tag. The library works with out-of-vocabulary (OOV) words by applying neural network techniques to classify word forms and choose the appropriate morphing rules.

The system acts as a standalone module or as an extension to the spaCy NLP system.

The dictionary and morphology rules are derived from the NIH's SPECIALIST Lexicon which contains an extensive set information on English word forms.

A more simplistic inflection only system is available as pyInflect. LemmInflect was created to address some of the shortcoming of that project and add features, such as...

  • Independence from the spaCy lemmatizer
  • Neural nets to disambiguate out of vocab morphology
  • Unigrams to dismabiguate spellings and multiple word forms

Documentation

For the latest documentation, see ReadTheDocs.

Accuracy of the Lemmatizer

The accuracy of LemmInflect and several other popular NLP utilities was tested using the Automatically Generated Inflection Database (AGID) as a baseline. The AGID has an extensive list of lemmas and their corresponding inflections. Each inflection was lemmatized by the test software and then compared to the original value in the corpus. The test included 119,194 different inflected words.

| Package          | Verb  |  Noun | ADJ/ADV | Overall |  Speed  |
|----------------------------------------------------------------|
| LemmInflect      | 96.1% | 95.4% |  93.9%  |  95.6%  | 42.0 uS |
| CLiPS/pattern.en | 93.6% | 91.1% |   0.0%  |  n/a    |  3.0 uS |
| Stanford CoreNLP | 87.6% | 93.1% |   0.0%  |  n/a    |  n/a    |
| spaCy            | 79.4% | 88.9% |  60.5%  |  84.7%  |  5.0 uS |
| NLTK             | 53.3% | 52.2% |  53.3%  |  52.6%  | 13.0 uS |
|----------------------------------------------------------------|
  • Speed is in micro-seconds per lemma and was conducted on a i9-7940x CPU.
  • The Stanford and CLiPS lemmatizers don't accept part-of-speech information and in the case of the pattern.en, the methods was setup specifically for verbs, not as a lemmatizer for all word types.
  • The Stanford CoreNLP lemmatizer is a Java package and testing was done via the built-in HTML server, thus the speed measurement is invalid.

Requirements and Installation

The only external requirement to run LemmInflect is numpy which is used for the matrix math that drives the neural nets. These nets are relatively small and don't require significant CPU power to run.

To install do..

pip3 install lemminflect

The project was built and tested under Python 3 and Ubuntu but should run on any Linux, Windows, Mac, etc.. system. It is untested under Python 2 but may function in that environment with minimal or no changes.

The code base also includes library functions and scripts to create the various data files and neural nets. This includes such things as...

  • Unigram Extraction from the Gutenberg and Billion Word Corpra
  • Python scripts for loading and parsing the SPECIALIST Lexicon
  • Nerual network training based on Keras and Tensorflow

None of these are required for run-time operation. However, if you want of modify the system, see the documentation for more info.

Library Usage

To lemmatize a word use the method getLemma(). This takes a word and a Universal Dependencies tag and returns the lemmas as a list of possible spellings. The dictionary system is used first, and if no lemma is found, the rules system is employed.

> from lemminflect import getLemma
getLemma('watches', upos='VERB')
('watch',)

To inflect words, use the method getInflection. This takes a lemma and a Penn Treebank tag and returns a tuple of the specific inflection(s) associated with that tag. Similary to above, the dictionary is used first and then inflection rules are applied if needed..

> from lemminflect import getInflection
> getInflection('watch', tag='VBD')
('watched',)

> getInflection('xxwatch', tag='VBD')
('xxwatched',)

The library provides lower-level functions to access the dictionary and the OOV rules directly. For a detailed description see Lemmatizer or Inflections.

Usage as a Spacy Extension

To use as an extension, you need spaCy version 2.0 or later. Versions 1.9 and earlier do not support the extension methods used here.

To setup the extension, first import lemminflect. This will create new lemma and inflect methods for each spaCy Token. The methods operate similarly to the methods described above, with the exception that a string is returned, containing the most common spelling, rather than a tuple.

> import spacy
> import lemminflect
> nlp = spacy.load('en_core_web_sm')
> doc = nlp('I am testing this example.')
> doc[2]._.lemma()
test

> doc[4]._.inflect('NNS')
examples

Issues

If you find a bug, please report it on the GitHub issues list. However be aware that when in comes to returning the correct inflection there are a number of different types of issues that can arise. Some of these are not readily fixable. Issues with inflected forms include...

  • Multiple spellings for an inflection (ie.. arthroplasties, arthroplastyes or arthroplastys)
  • Mass form and plural types (ie.. people vs peoples)
  • Forms that depend on context (ie.. further vs farther)
  • Infections that are not fully specified by the tag (ie.. be/VBD can be "was" or "were")

One common issue is that some forms of the verb "be" are not completely specified by the treekbank tag. For instance be/VBD inflects to either "was" or "were" and be/VBP inflects to either "am", or "are". In order to disambiguate these forms, other words in the sentence need to be inspected. At this time, LemmInflect doesn't include this functionality.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].