All Projects → kwrobel-nlp → krnnt

kwrobel-nlp / krnnt

Licence: LGPL-3.0 License
Polish morphological tagger.

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to krnnt

validate-polish
Utility library for validation of PESEL, NIP, REGON, identity card etc. Aimed mostly at Polish enviroment. [Polish] Walidacja numerów pesel, nip, regon, dowodu osobistego.
Stars: ✭ 31 (-6.06%)
Mutual labels:  polish
arabic-tagger
AQMAR Arabic Tagger: Sequence tagger with cost-augmented structured perceptron training
Stars: ✭ 38 (+15.15%)
Mutual labels:  tagger
solr-ontology-tagger
Automatic tagging and analysis of documents in an Apache Solr index for faceted search by RDF(S) Ontologies & SKOS thesauri
Stars: ✭ 36 (+9.09%)
Mutual labels:  tagger
Picard
MusicBrainz Picard audio file tagger
Stars: ✭ 2,605 (+7793.94%)
Mutual labels:  tagger
nodejs-support
한국어 형태소 및 구문 분석기의 모음인, KoalaNLP의 Javascript(Node.js) 버전입니다.
Stars: ✭ 81 (+145.45%)
Mutual labels:  tagger
PoLitBert
Polish RoBERTA model trained on Polish literature, Wikipedia, and Oscar. The major assumption is that quality text will give a good model.
Stars: ✭ 25 (-24.24%)
Mutual labels:  polish
Pytorch-POS-Tagger
Part-of-Speech Tagger and custom implementations of LSTM, GRU and Vanilla RNN
Stars: ✭ 24 (-27.27%)
Mutual labels:  pos-tagger
pl.javascript.info
Modern JavaScript Tutorial in Polish
Stars: ✭ 30 (-9.09%)
Mutual labels:  polish
RcppMeCab
RcppMeCab: Rcpp Interface of CJK Morpheme Analyzer MeCab
Stars: ✭ 24 (-27.27%)
Mutual labels:  tagger
nltk-maxent-pos-tagger
maximum entropy based part-of-speech tagger for NLTK
Stars: ✭ 45 (+36.36%)
Mutual labels:  pos-tagger
corpusexplorer2.0
Korpuslinguistik war noch nie so einfach...
Stars: ✭ 16 (-51.52%)
Mutual labels:  tagger
tag-picker
Better tags input interaction with JavaScript.
Stars: ✭ 27 (-18.18%)
Mutual labels:  tagger
datalinguist
Stanford CoreNLP in idiomatic Clojure.
Stars: ✭ 93 (+181.82%)
Mutual labels:  pos-tagger
redcoat
A lightweight web-based annotation tool for labelling entity recognition data.
Stars: ✭ 19 (-42.42%)
Mutual labels:  tagger
Manga-Tagger
The only tool you'll need to rename and write metadata to your digital manga library
Stars: ✭ 110 (+233.33%)
Mutual labels:  tagger
Kcws
Deep Learning Chinese Word Segment
Stars: ✭ 2,061 (+6145.45%)
Mutual labels:  pos-tagger
frog
Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.
Stars: ✭ 70 (+112.12%)
Mutual labels:  pos-tagger
TextSummarizer
TextRank implementation for C#
Stars: ✭ 29 (-12.12%)
Mutual labels:  pos-tagger
unsupervised-pos-tagging
教師なし品詞タグ推定
Stars: ✭ 16 (-51.52%)
Mutual labels:  pos-tagger
InformationExtractionSystem
Information Extraction System can perform NLP tasks like Named Entity Recognition, Sentence Simplification, Relation Extraction etc.
Stars: ✭ 27 (-18.18%)
Mutual labels:  tagger

KRNNT

KRNNT is a morphological tagger for Polish based on recurrent neural networks.

Try KRNNT online

It was presented at 8th Language & Technology Conference. More details are available in the paper:

@inproceedings{wrobel2017,
  author       = "Wróbel, Krzysztof",
  editor       = "Vetulani, Zygmunt and Paroubek, Patrick",
  title        = "KRNNT: Polish Recurrent Neural Network Tagger",
  year         = "2017",
  booktitle    = "Proceedings of the 8th Language \& Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics",
  publisher    = "Fundacja Uniwersytetu im. Adama Mickiewicza w~Poznaniu",
  pages        = {386-391},
  pdf          = "http://ltc.amu.edu.pl/book2017/papers/PolEval1-6.pdf"
}

Online version is available at: http://ltc.amu.edu.pl/book2017/papers/PolEval1-6.pdf

Copy: https://www.researchgate.net/publication/333566748_KRNNT_Polish_Recurrent_Neural_Network_Tagger

External tools

The tagger uses external tools: tokenizer Toki and morphological analyzer Morfeusz. Maca (http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki) integrates both tools.

The tagset is described here: http://nkjp.pl/poliqarp/help/ense2.html

Getting started

You can run KRNNT using docker or by manual installation.

Docker

Docker image was prepared by Aleksander Smywiński-Pohl and instrutions are available at: https://hub.docker.com/r/djstrong/krnnt/

  1. Download and start the server.
docker run -p 9003:9003 -it djstrong/krnnt:1.0.0
  1. Tag a text using POST request or open http://localhost:9003 in a browser.
$ curl -XPOST localhost:9003 -d "Ala ma kota."
Ala    none
    Ala    subst:sg:nom:f    disamb
ma    space
    mieć    fin:sg:ter:imperf    disamb
kota    space
    kot    subst:sg:acc:m2    disamb
.    none
    .    interp    disamb

Manual installation

Please refer to the docker file: https://github.com/kwrobel-nlp/dockerfiles/blob/morfeusz2/tagger/Dockerfile

  1. Install Maca: http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki

Make sure that command maca-analyse works:

echo "Ala ma kota." | maca-analyse -qc morfeusz2-nkjp
Ala	newline
	Al	subst:sg:gen:m1
	Al	subst:sg:acc:m1
	Ala	subst:sg:nom:f
	Alo	subst:sg:gen:m1
	Alo	subst:sg:acc:m1
ma	space
	mieć	fin:sg:ter:imperf
	mój	adj:sg:nom:f:pos
	mój	adj:sg:voc:f:pos
kota	space
	Kot	subst:sg:gen:m1
	Kot	subst:sg:acc:m1
	kot	subst:sg:gen:m1
	kot	subst:sg:acc:m1
	kot	subst:sg:gen:m2
	kot	subst:sg:acc:m2
	kota	subst:sg:nom:f
.	none
	.	interp
  1. Clone KRNNT repository:
git clone https://github.com/kwrobel-nlp/krnnt.git
  1. Install dependencies.
pip3 install -e .

Evaluation

Accuracy tested with 10-fold cross validation on National Corpus of Polish.

Accuracy lower bound Accuracy lower bound for unknown tokens
93.72% 69.03%

PolEval

The tagger participated in PolEval 2017 competition: http://2017.poleval.pl/index.php/results/

The original submission was created using tag poleval.

Training

  1. Install KRNNT.
krnnt]$ pip3 install -e .
  1. Prepare training data.
krnnt]$ time process_xces.py train-gold.xml train-gold.spickle 
0:14.75
  1. Reanalyze corpus with Maca.
krnnt]$ python3 reanalyze.py train-gold.spickle train-reanalyzed.spickle
Number of sentences by Maca vs gold 9 10
Number of sentences by Maca vs gold 7 8                                                                                                                                         | 2/18484 [00:00<15:41, 19.63it/s]Number of sentences by Maca vs gold 7 6
Number of sentences by Maca vs gold 8 10
Number of sentences by Maca vs gold 3 3
Number of sentences by Maca vs gold 3 3
Number of sentences by Maca vs gold 7 7
Number of sentences by Maca vs gold 4 4
...
1:30.30

Ensure that last two numbers in each row are usually the same. Zeros indicates problems with Maca.

  1. Shuffle data (optional).
krnnt]$ time python3 shuffle.py train-reanalyzed.spickle train-reanalyzed.shuf.spickle
0:28.95
  1. Preprocess data.
krnnt]$ time python3 preprocess_data.py train-reanalyzed.shuf.spickle train-reanalyzed.shuf.spickle.preprocessed
0:56.80
  1. Create dictionary for all features.
krnnt]$ time python3 create_dict.py train-reanalyzed.shuf.spickle.preprocessed train-reanalyzed.shuf.spickle.dict
0:18.93
  1. Train lemmatization module.
krnnt]$ time python3 train_lemmatization.py train-reanalyzed.shuf.spickle.preprocessed --hash model_nkjp
0:09.55
  1. Train for 150 epochs. Add -d 0.1 for using 10% of training data as development data set.
krnnt]$ python3 train.py train-reanalyzed.shuf.spickle.preprocessed train-reanalyzed.shuf.spickle.dict -e 150 --patience 150 --hash model_nkjp --test_data poleval-reanalyzed.shuf.spickle.preprocessed

Check ~/.keras/keras.json for "image_data_format": "channels_first".

  1. Testing.
krnnt]$ time python3 krnnt_run.py weight_model_nkjp.hdf5.final lemmatisation_model_nkjp.pkl train-reanalyzed.shuf.spickle.dict < test-raw.txt > test-raw.xml
0:09.02
  1. Evaluate.
krnnt]$ python2 tagger-eval.py gold-task-c.xml test-raw.xml -t poleval -s
PolEval 2017 competition scores
-------------------------------
POS accuracy (Subtask A score): 	92.3308%
POS accuracy (known words): 	92.3308%
POS accuracy (unknown words): 	0.0000%
Lemmatization accuracy (Subtask B score): 	96.8816%
Lemmatization accuracy (known words): 	96.8816%
Lemmatization accuracy (unknown words): 	0.0000%
Overall accuracy (Subtask C score): 	94.6062%

Training on gold segmentation

  1. Prepare training data.
krnnt]$ time python3 process_xces.py train-analyzed.xml train-analyzed.spickle
real	0m37.211s

krnnt]$ time python3 process_xces.py train-gold.xml train-gold.spickle
real	0m14.750s

krnnt]$ time python3 merge_analyzed_gold.py train-gold.spickle train-analyzed.spickle train-merged.spickle
real	0m18.215s
  1. Shuffle data (optional).
krnnt]$ time python3 shuffle.py train-merged.spickle train-merged.shuf.spickle
real	0m21,636s
  1. Preprocess data.
time python3 preprocess_data.py -p train-merged.shuf.spickle train-merged.shuf.spickle.preprocessed
real	0m52,872s
  1. Create dictionary for all features.
krnnt]$ time python3 create_dict.py train-merged.shuf.spickle.preprocessed train-merged.shuf.spickle.dict
real	0m19,756s
  1. Train lemmatization module.
krnnt]$ time python3 train_lemmatization.py train-merged.shuf.spickle.preprocessed --hash model_nkjp_pre
real	0m7,184s
  1. Train for 150 epochs.
krnnt]$ python3 train.py train-merged.shuf.spickle.preprocessed train-merged.shuf.spickle.dict -e 150 --patience 150 --hash model_nkjp_pre --test_data poleval-reanalyzed.shuf.spickle.preprocessed

  1. Testing.
krnnt]$ time python3 krnnt_run.py -p weight_model_nkjp_pre.hdf5.final lemmatisation_model_nkjp_pre.pkl train-merged.shuf.spickle.dict < test-analyzed.xml > test-analyzed.xml.pred
real	0m8,426s
  1. Evaluate.
krnnt]$ python2 tagger-eval.py gold-task-a-b.xml test-analyzed.xml.pred -t poleval -s
PolEval 2017 competition scores
-------------------------------
POS accuracy (Subtask A score): 	93.9106%
POS accuracy (known words): 	93.9106%
POS accuracy (unknown words): 	0.0000%
Lemmatization accuracy (Subtask B score): 	97.8654%
Lemmatization accuracy (known words): 	97.8654%
Lemmatization accuracy (unknown words): 	0.0000%
Overall accuracy (Subtask C score): 	95.8880%

Testing

Trained models are available with releases: https://github.com/kwrobel-nlp/krnnt/releases

krnnt]$ pip3 install -e .

reana.zip contains model trained with reanalyzed data:

krnnt]$ python3 krnnt_run.py reana/weights_reana.hdf5 reana/lemmatisation_reana.pkl reana/dictionary_reana.pkl < test-raw.txt > test-raw.krnnt.xml

preana.zip contains model trained with preanalyzed data:

krnnt]$ python3 krnnt_run.py -p preana/weights_preana.hdf5 preana/lemmatisation_preana.pkl preana/dictionary_preana.pkl < test-analyzed.xml > test-analyzed.krnnt.xml

Voting

Training more models and performing simple voting increase accuracy. Voting over 10 models achieves about 94.30% accuracy lower bound.

reana10.zip and preana10.zip contain 10 models each.

for i in {0..9}
do
   krnnt]$ python3 krnnt_run.py reana/$i.hdf5 reana/lemmatisation.pkl  reana/dictionary.pkl < test-raw.txt > reana/$i.xml
done
krnnt]$ python3 voting.py reana/ > reana/test-raw.krnnt.voting.xml

KRNNT is licensed under GNU LGPL v3.0.

Input formats

  • text
    • raw format (default) - one document (e.g. Wikipedia article, need of token offsets)
    • lines format - documents are separated by empty line
  • pretokenized text
    • JSON
      • verbose
      • compact

Input format is determined automatically, except lines format.

Text

Raw format

Lines format

?input_format=lines

Pretokenized text

The format consists of documents. Each document have sentences, and each sentence have tokens.

Fields for each token:

  • form
  • separator (optional) - white characters before token: newline, space or none
  • start (optional) - staring offset of the token
  • end (optional) - ending offset of the token

Separator is a feature for the classifier. If separator is not provided and start and end positions are provided then separator is computed. If separator, start and end fields are not provided then separator is set to True.

Verbose JSON

Verbose JSON uses dictionaries.

{
  "documents": [
    {
      "text": "Lubię placki. Ala ma kota.\nRaz dwa trzy.",
      "sentences": [
        {
          "tokens": [
            {
              "form": "Lubię",
              "separator": true,
              "start": 0,
              "end": 0
            },
            {
              "form": "placki",
              "separator": true,
              "start": 0,
              "end": 0
            },
            {
              "form": ".",
              "separator": false,
              "start": 0,
              "end": 0
            }
          ]
        },
        {
          "tokens": [
            {
              "form": "Ala",
              "separator": true,
              "start": 0,
              "end": 0
            },
            {
              "form": "ma",
              "separator": true,
              "start": 0,
              "end": 0
            },
            {
              "form": "kota",
              "separator": true,
              "start": 0,
              "end": 0
            },
            {
              "form": ".",
              "separator": false,
              "start": 0,
              "end": 0
            }
          ]
        }
      ]
    },
    {
      "text": "",
      "sentences": [
        {
          "tokens": [
            {
              "form": "Raz",
              "separator": true,
              "start": 0,
              "end": 0
            },
            {
              "form": "dwa",
              "separator": true,
              "start": 0,
              "end": 0
            },
            {
              "form": "trzy",
              "separator": true,
              "start": 0,
              "end": 0
            },
            {
              "form": ".",
              "separator": false,
              "start": 0,
              "end": 0
            }
          ]
        }
      ]
    }
  ]
}

Compact JSON

Compact JSON uses lists and positional fields (for speed and memory efficiency).

[
  [
    [["Lubię",true],["placki",true],[".",false]],
    [["Ala",true],["ma",true],["kota",true],[".",false]]
  ],
  [
    [["Raz",true],["dwa",true],["trzy",true],[".",false]]
  ]
]

Output formats

  • JSON
  • JSONL - each document in separate line
  • TSV (CONLL)
  • XCES
  • plain - sentences divided by one empty line, documents by two empty lines

HTTP options

Default output format is plain. It can be changed by request parameter output_format, e.g.:

$ curl -X POST "localhost:9003/?output_format=conll" -d "Ala ma kota."
Ala	Ala	1	subst:sg:nom:f	0	3
ma	mieć	1	fin:sg:ter:imperf	4	6
kota	kot	1	subst:sg:acc:m2	7	11
.	.	0	interp	11	12

remove_aglt (default 0) - indicates if aglt tags should be removed remove_blank (default 1) - indicates if blank tags should be removed

Scripts

  • analyze_corpus_tagset_date.py - analyze corpus tagset version
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].