All Projects → smc → mlmorph

smc / mlmorph

Licence: MIT license
Malayalam Morphological Analyzer using Finite State Transducer

Programming Languages

Lex
420 projects
F*
10 projects
python
139335 projects - #7 most used programming language
Makefile
30231 projects
sed
78 projects
shell
77523 projects

Projects that are alternatives of or similar to mlmorph

treestoolbox
TREES toolbox
Stars: ✭ 20 (-50%)
Mutual labels:  morphology, morphological-analysis
langua
A suite of language tools
Stars: ✭ 29 (-27.5%)
Mutual labels:  morphology, lexicon
zeyrek
Python morphological analyzer for Turkish language. Partial port of ZemberekNLP.
Stars: ✭ 36 (-10%)
Mutual labels:  morphology, morphological-analysis
KoParadigm
KoParadigm: Korean Inflectional Paradigm Generator
Stars: ✭ 48 (+20%)
Mutual labels:  morphology, inflection
retinal-exudates-detection
exudates detection using hybrid approach (Image Morphology & Machine Learning)
Stars: ✭ 53 (+32.5%)
Mutual labels:  morphology, morphological-analysis
OpenGNT
Open Greek New Testament Project; NA28 / NA27 Equivalent Text & Resources
Stars: ✭ 55 (+37.5%)
Mutual labels:  morphology, lexicon
aot
Russian morphology for Java
Stars: ✭ 41 (+2.5%)
Mutual labels:  morphology, morphological-analysis
sklonenie
Light-weight and fast library to decline Russian names
Stars: ✭ 23 (-42.5%)
Mutual labels:  inflection
frog
Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.
Stars: ✭ 70 (+75%)
Mutual labels:  morphology
teanaps
자연어 처리와 텍스트 분석을 위한 오픈소스 파이썬 라이브러리 입니다.
Stars: ✭ 91 (+127.5%)
Mutual labels:  morphological-analysis
NRCLex
An affect generator based on TextBlob and the NRC affect lexicon. Note that lexicon license is for research purposes only.
Stars: ✭ 42 (+5%)
Mutual labels:  lexicon
nerus
Large silver standart Russian corpus with NER, morphology and syntax markup
Stars: ✭ 47 (+17.5%)
Mutual labels:  morphology
Chilanka
Chilanka handwriting style Malayalam font
Stars: ✭ 18 (-55%)
Mutual labels:  malayalam
Deep-NLP-Resources
Curated list of all NLP Resources
Stars: ✭ 65 (+62.5%)
Mutual labels:  lexicon
YouTube to m3u
Grab .m3u8 from YouTube live channels and makes .m3u IPTV Playlist from various languages and Events. Tamil / Malayalam / English / Hindi / French / Kids / Sports / Urudu etc.
Stars: ✭ 48 (+20%)
Mutual labels:  malayalam
dry-inflector
Inflector for Ruby
Stars: ✭ 89 (+122.5%)
Mutual labels:  inflection
multilingual-g2p
Multilingual Grapheme to Phoneme
Stars: ✭ 40 (+0%)
Mutual labels:  lexicon
libmorph
libmorph rus/ukr - fast & accurate morphological analyzer/analyses for Russian and Ukrainian
Stars: ✭ 16 (-60%)
Mutual labels:  morphological-analysis
GrammarEngine
Грамматический Словарь Русского Языка (+ английский, японский, etc)
Stars: ✭ 68 (+70%)
Mutual labels:  morphological-analysis
ImageMorphology.jl
Morphological operations for image processing
Stars: ✭ 23 (-42.5%)
Mutual labels:  morphology

Malayalam Morphological Analyzer using Finite State Transducer

PyPI Version

mlmorph

Introduction

mlmorph is Malayalam Morphology Analyzer and Generator. It aims to build a morphological model for Malayalam language using Finite State Transducer technology. Specifically, the system is developed using Stuttgart Finite State Toolkit(SFST) formalism.

Malayalam is a heavily inflected and agglutinated language and this project attempt to iteratively develop a morphological model for it.

For a detailed introduction and explanation of approach, please refer the blog post https://thottingal.in/blog/2017/11/26/towards-a-malayalam-morphology-analyser/

Status

Currently the analyser can parse(or recognize) 80%+ of words in our test corpora of 50000 Malayalam words. The lexicon prepared is being updated and expanded to include more commonly used words. Morpho-Phonological rules are still being added, even though the common constructs are already added.

Installation and usage

The easiest way to start using mlmorph is by installing the python library that comes with the compiled automata.

pip install mlmorph

For API documentation and command line usage, See https://pypi.org/project/mlmorph

Morphological analysis example

from mlmorph import Analyser
analyser = Analyser()
analyser.analyse("കേരളത്തിന്റെ")

Gives [('കേരളം<np><genitive>', 179)]

The second item in this result is the weight. Sometimes a single word can have multiple analysis. The analysis with less weight is the preferred analysis.

Morphological generator example

from mlmorph import Generator
generator = Generator()
generator.generate("കേരളം<np><genitive>")

Gives (('കേരളത്തിന്റെ', 0.0),)

Command line interface

$ python -m mlmorph --help
usage: __main__.py [-h] [-i INFILE] [-a] [-g] [-v]

optional arguments:
    -h, --help            show this help message and exit
    -i INFILE, --input INFILE
                        source of analysis data
    -a, --analyse         Analyse the input file strings
    -g, --generate        Generate the input file strings
    -v, --verbose         print verbosely while processing

Accepts strings from stdin too. For example:

$ python -m mlmorph -a
നിറങ്ങൾ
നിറങ്ങൾ   നിറം<n><pl>

Applications

Spellchecker

A spellchecker based on this analyser is being developed. See https://gitlab.com/smc/mlmorph-spellchecker. You can try out an online version at morph.smc.org.in/spellcheck

A libreoffice extension to use the spellchecker is also being developed. See https://gitlab.com/smc/mlmorph-libreoffice-spellchecker

Analysing numbers

The textual form of Malayalam numbers has an interesting characterstic that it is a limited vocabulary set creating infinite number of words by agglutination of number parts. A number like 12345 is written as പന്ത്രണ്ടായിരത്തിമുന്നൂറ്റിനാൽപത്തഞ്ച്. This is composed from 12-പന്ത്രണ്ട്, 1000=ആയിരം, 300=മുന്നൂറ്, 40-നാല്പത്, 5 - അഞ്ച്. Agglutination happens at 5 places in this word. When agglutination happens the morpheme boundaries change in left side or right side or both. The number module of mlmorph analyser is powerful enough to analyse and generate any arbitrary number in its text format(or aims to do so). Some examples are given below.

For more details and a demo, please refer https://thottingal.in/blog/2017/12/10/number-spellout-and-generation-in-malayalam-using-morphology-analyser/

Named Entity Recognition

Named Entity Recognition, a task of identifying and classifying real world objects such as persons, places, organizations from a given text is a well known NLP problem. Since mlmorph gives the POS tagging and analysis, there is not much to do in NER. We just need to look for tags corresponding to proper nouns and report. You can try the system at https://morph.smc.org.in/ner

Detailed documentation: https://thottingal.in/blog/2019/03/10/malayalam-named-entity-recognition-using-morphology-analyser/

For Developers

You need Stuttgart Finite State Toolkit(SFST) to compile and use this analyzer.

The Makefile provided compiles compiles all the sources and produces the binary FSA 'malayalam.a'. Running 'make' should be enough to get started.

In a debian/ubuntu based GNU/Linux, SFST can be installed as follows

$ sudo apt install sfst

Clone or download this git repository to your machine.

Build the FST by

$ make

This will create a file named malayalam.a which is the compiled generator. Individual modules will get also created like num.a which is number generator.

Tests

The analyser is being developed with lot of tests. To run tests :

$ make test

Citation

Please cite the following publication in order to refer to the mlmorph:

@inproceedings{thottingal-2019-finite,
    title = "Finite State Transducer based Morphology analysis for {M}alayalam Language",
    author = "Thottingal, Santhosh",
    booktitle = "Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages",
    month = "20 " # aug,
    year = "2019",
    address = "Dublin, Ireland",
    publisher = "European Association for Machine Translation",
    url = "https://www.aclweb.org/anthology/W19-6801",
    pages = "1--5",
}

License

mlmorph is under MIT license.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].