All Projects → adbar → simplemma

adbar / simplemma

Licence: MIT license
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to simplemma

libmorph
libmorph rus/ukr - fast & accurate morphological analyzer/analyses for Russian and Ukrainian
Stars: ✭ 16 (-50%)
Mutual labels:  lemmatizer, morphological-analysis, lemmatization
GrammarEngine
Грамматический Словарь Русского Языка (+ английский, японский, etc)
Stars: ✭ 68 (+112.5%)
Mutual labels:  lemmatizer, morphological-analysis, lemmatization
wink-tokenizer
Multilingual tokenizer that automatically tags each token with its type
Stars: ✭ 51 (+59.38%)
Mutual labels:  tokenizer, tokenization
TweebankNLP
[LREC 2022] An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweebank-NER dataset
Stars: ✭ 84 (+162.5%)
Mutual labels:  tokenization, lemmatization
lemma
A Morphological Parser (Analyser) / Lemmatizer written in Elixir.
Stars: ✭ 45 (+40.63%)
Mutual labels:  lemmatizer, lemmatization
alix
A Lucene Indexer for XML, with lexical analysis (lemmatization for French)
Stars: ✭ 15 (-53.12%)
Mutual labels:  lemmatizer, lemmatization
zeyrek
Python morphological analyzer for Turkish language. Partial port of ZemberekNLP.
Stars: ✭ 36 (+12.5%)
Mutual labels:  morphological-analysis, lemmatization
nlp-cheat-sheet-python
NLP Cheat Sheet, Python, spacy, LexNPL, NLTK, tokenization, stemming, sentence detection, named entity recognition
Stars: ✭ 69 (+115.63%)
Mutual labels:  tokenization, lemmatization
udar
UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.
Stars: ✭ 15 (-53.12%)
Mutual labels:  morphological-analysis, lemmatization
ling
Natural Language Processing Toolkit in Golang
Stars: ✭ 57 (+78.13%)
Mutual labels:  tokenization, lemmatization
wink-lemmatizer
English lemmatizer
Stars: ✭ 53 (+65.63%)
Mutual labels:  lemmatizer, lemmatization
Kagome
Self-contained Japanese Morphological Analyzer written in pure Go
Stars: ✭ 554 (+1631.25%)
Mutual labels:  tokenizer, morphological-analysis
Jumanpp
Juman++ (a Morphological Analyzer Toolkit)
Stars: ✭ 254 (+693.75%)
Mutual labels:  tokenizer, morphological-analysis
suika
Suika 🍉 is a Japanese morphological analyzer written in pure Ruby
Stars: ✭ 31 (-3.12%)
Mutual labels:  tokenizer, morphological-analysis
HebPipe
An NLP pipeline for Hebrew
Stars: ✭ 15 (-53.12%)
Mutual labels:  morphological-analysis, lemmatization
xontrib-output-search
Get identifiers, paths, URLs and words from the previous command output and use them for the next command in xonsh shell.
Stars: ✭ 26 (-18.75%)
Mutual labels:  tokenizer, tokenization
jargon
Tokenizers and lemmatizers for Go
Stars: ✭ 98 (+206.25%)
Mutual labels:  tokenizer, lemmatizer
Turkish-Lemmatizer
Lemmatization for Turkish Language
Stars: ✭ 72 (+125%)
Mutual labels:  lemmatizer, lemmatization
mystem-scala
Morphological analyzer `mystem` (Russian language) wrapper for JVM languages
Stars: ✭ 21 (-34.37%)
Mutual labels:  tokenizer, lemmatizer
spacy russian tokenizer
Custom Russian tokenizer for spaCy
Stars: ✭ 35 (+9.38%)
Mutual labels:  tokenization

Simplemma: a simple multilingual lemmatizer for Python

Python package License Python versions Code Coverage

Purpose

Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. Unlike stemming, lemmatization outputs word units that are still valid linguistic forms.

In modern natural language processing (NLP), this task is often indirectly tackled by more complex systems encompassing a whole processing pipeline. However, it appears that there is no straightforward way to address lemmatization in Python although this task is useful in information retrieval and natural language processing.

Simplemma provides a simple and multilingual approach to look for base forms or lemmata. It may not be as powerful as full-fledged solutions but it is generic, easy to install and straightforward to use. In particular, it doesn't need morphosyntactic information and can process a raw series of tokens or even a text with its built-in (simple) tokenizer. By design it should be reasonably fast and work in a large majority of cases, without being perfect.

With its comparatively small footprint it is especially useful when speed and simplicity matter, in low-resource contexts, for educational purposes, or as a baseline system for lemmatization and morphological analysis.

Currently, 38 languages are partly or fully supported (see table below).

Installation

The current library is written in pure Python with no dependencies:

pip install simplemma

  • pip3 where applicable
  • pip install -U simplemma for updates

Usage

Word-by-word

Simplemma is used by selecting a language of interest and then applying the data on a list of words.

>>> import simplemma
# get a word
myword = 'masks'
# decide which language data to load
>>> langdata = simplemma.load_data('en')
# apply it on a word form
>>> simplemma.lemmatize(myword, langdata)
'mask'
# grab a list of tokens
>>> mytokens = ['Hier', 'sind', 'Vaccines']
>>> langdata = simplemma.load_data('de')
>>> for token in mytokens:
>>>     simplemma.lemmatize(token, langdata)
'hier'
'sein'
'Vaccines'
# list comprehensions can be faster
>>> [simplemma.lemmatize(t, langdata) for t in mytokens]
['hier', 'sein', 'Vaccines']

Chaining several languages can improve coverage:

>>> langdata = simplemma.load_data('de', 'en')
>>> simplemma.lemmatize('Vaccines', langdata)
'vaccine'
>>> langdata = simplemma.load_data('it')
>>> simplemma.lemmatize('spaghettis', langdata)
'spaghettis'
>>> langdata = simplemma.load_data('it', 'fr')
>>> simplemma.lemmatize('spaghettis', langdata)
'spaghetti'
>>> simplemma.lemmatize('spaghetti', langdata)
'spaghetto'

There are cases in which a greedier decomposition and lemmatization algorithm is better. It is deactivated by default:

# same example as before, comes to this result in one step
>>> simplemma.lemmatize('spaghettis', mydata, greedy=True)
'spaghetto'
# a German case
>>> langdata = simplemma.load_data('de')
>>> simplemma.lemmatize('angekündigten', langdata)
'ankündigen' # infinitive verb
>>> simplemma.lemmatize('angekündigten', langdata, greedy=False)
'angekündigt' # past participle

Tokenization

A simple tokenization function is included for convenience:

>>> from simplemma import simple_tokenizer
>>> simple_tokenizer('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.')
['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.']

The function text_lemmatizer() chains tokenization and lemmatization. It can take greedy (affecting lemmatization) and silent (affecting errors and logging) as arguments:

>>> from simplemma import text_lemmatizer
>>> langdata = simplemma.load_data('pt')
>>> text_lemmatizer('Sou o intervalo entre o que desejo ser e os outros me fizeram.', langdata)
# caveat: desejo is also a noun, should be desejar here
['ser', 'o', 'intervalo', 'entre', 'o', 'que', 'desejo', 'ser', 'e', 'o', 'outro', 'me', 'fazer', '.']

Caveats

# don't expect too much though
>>> langdata = simplemma.load_data('it')
# this diminutive form isn't in the model data
>>> simplemma.lemmatize('spaghettini', langdata)
'spaghettini' # should read 'spaghettino'
# the algorithm cannot choose between valid alternatives yet
>>> langdata = simplemma.load_data('es')
>>> simplemma.lemmatize('son', langdata)
'son' # valid common name, but what about the verb form?

As the focus lies on overall coverage, some short frequent words (typically: pronouns and conjunctions) may need post-processing, this generally concerns a few dozens of tokens per language.

Additionally, the current absence of morphosyntactic information is both an advantage in terms of simplicity and an impassable frontier with respect to lemmatization accuracy, e.g. to disambiguate between past participles and adjectives derived from verbs in Germanic and Romance languages. In most cases, simplemma often doesn't change the input then.

The greedy algorithm rarely produces forms that are not valid. It is designed to work best in the low-frequency range, notably for compound words and neologisms. Aggressive decomposition is only useful as a general approach in the case of morphologically-rich languages. It can also act as a linguistically motivated stemmer.

Bug reports over the issues page are welcome.

Supported languages

The following languages are available using their ISO 639-1 code:

Available languages (2022-04-06)
Code Language Words (10³) Acc. Comments
bg Bulgarian 213    
ca Catalan 579    
cs Czech 187 0.88 on UD CS-PDT
cy Welsh 360    
da Danish 554 0.92 on UD DA-DDT, alternative: lemmy
de German 682 0.95 on UD DE-GSD, see also German-NLP list
el Greek 183 0.88 on UD EL-GDT
en English 136 0.94 on UD EN-GUM, alternative: LemmInflect
es Spanish 720 0.94 on UD ES-GSD
et Estonian 133   low coverage
fa Persian 10   low coverage, potential issues
fi Finnish 2,106   alternatives: voikko or NLP list
fr French 217 0.94 on UD FR-GSD
ga Irish 383    
gd Gaelic 48    
gl Galician 384    
gv Manx 62    
hu Hungarian 458    
hy Armenian 323    
id Indonesian 17 0.91 on UD ID-CSUI
it Italian 333 0.93 on UD IT-ISDT
ka Georgian 65    
la Latin 850    
lb Luxembourgish 305    
lt Lithuanian 247    
lv Latvian 168    
mk Macedonian 57    
nb Norwegian (Bokmål) 617    
nl Dutch 254 0.91 on UD-NL-Alpino
pl Polish 3,733 0.91 on UD-PL-PDB
pt Portuguese 933 0.92 on UD-PT-GSD
ro Romanian 311    
ru Russian 607   alternative: pymorphy2
sk Slovak 846 0.92 on UD SK-SNK
sl Slovenian 97   low coverage
sv Swedish 658   alternative: lemmy
tr Turkish 1,333 0.88 on UD-TR-Boun
uk Ukrainian 190   alternative: pymorphy2

Low coverage mentions means one would probably be better off with a language-specific library, but simplemma will work to a limited extent. Open-source alternatives for Python are referenced if possible.

The scores are calculated on Universal Dependencies treebanks on single word tokens (including some contractions but not merged prepositions), they describe to what extent simplemma can accurately map tokens to their lemma form. They can be reproduced using the script udscore.py in the tests/ folder.

This library is particularly relevant as regards the lemmatization of less frequent words. Its performance in this case is only incidentally captured by the benchmark above.

Roadmap

  • [-] Add further lemmatization lists
  • [ ] Grammatical categories as option
  • [ ] Function as a meta-package?
  • [ ] Integrate optional, more complex models?

Credits

Software under MIT license, for the linguistic information databases see licenses folder.

The surface lookups (non-greedy mode) use lemmatization lists taken from various sources:

This rule-based approach based on flexion and lemmatizations dictionaries is to this day an approach used in popular libraries such as spacy.

Contributions

Feel free to contribute, notably by filing issues for feedback, bug reports, or links to further lemmatization lists, rules and tests.

You can also contribute to this lemmatization list repository.

Other solutions

See lists: German-NLP and other awesome-NLP lists.

For a more complex and universal approach in Python see universal-lemmatizer.

References

Barbaresi A. (2021). Simplemma: a simple multilingual lemmatizer for Python. Zenodo. http://doi.org/10.5281/zenodo.4673264

This work draws from lexical analysis algorithms used in:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].