All Projects → ppke-nlpg → Purepos

ppke-nlpg / Purepos

Licence: lgpl-3.0
PurePos is an open source hybrid morphological tagger.

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Purepos

Rfdmovies Client
🎬instant recommending or finding or downloading movies via the command line
Stars: ✭ 18 (+50%)
Mutual labels:  parser
Dsongo
Encoding, decoding, marshaling, unmarshaling, and verification of the DSON (Doge Serialized Object Notation)
Stars: ✭ 23 (+91.67%)
Mutual labels:  parser
Hx Mathparser
Evaluates math expressions. Written in Haxe.
Stars: ✭ 7 (-41.67%)
Mutual labels:  parser
Chthollylang
A simple implementation of Yet another script language Chtholly
Stars: ✭ 19 (+58.33%)
Mutual labels:  parser
Jkt
Simple helper to parse JSON based on independent schema
Stars: ✭ 22 (+83.33%)
Mutual labels:  parser
Badx12
A Python Library for parsing ANSI ASC X12 files.
Stars: ✭ 25 (+108.33%)
Mutual labels:  parser
Pegviz
PEG trace visualizer
Stars: ✭ 18 (+50%)
Mutual labels:  parser
Itunes smartplaylist
iTunes Smart playlist parser with Python. Convert to Kodi xsp smart playlists.
Stars: ✭ 10 (-16.67%)
Mutual labels:  parser
Radon
A scripting language.
Stars: ✭ 22 (+83.33%)
Mutual labels:  parser
Parse Code Context
Parse code context in a single line of javascript, for functions, variable declarations, methods, prototype properties, prototype methods etc.
Stars: ✭ 7 (-41.67%)
Mutual labels:  parser
Mico
Mico ("Monkey" in catalan). Monkey language implementation done with C++. https://interpreterbook.com/
Stars: ✭ 19 (+58.33%)
Mutual labels:  parser
Graphql Mst
Convert GraphQL to mobx-state-tree models
Stars: ✭ 22 (+83.33%)
Mutual labels:  parser
Librini
Rini is a tiny, non-libc dependant, .ini file parser programmed from scratch in C99.
Stars: ✭ 25 (+108.33%)
Mutual labels:  parser
Ccalc
Scientific calculator in which you can define new constants and functions
Stars: ✭ 19 (+58.33%)
Mutual labels:  parser
Html React Parser
📝 HTML to React parser.
Stars: ✭ 846 (+6950%)
Mutual labels:  parser
Arg
Simple argument parsing
Stars: ✭ 897 (+7375%)
Mutual labels:  parser
Lisp Esque Language
💠The Lel programming language
Stars: ✭ 24 (+100%)
Mutual labels:  parser
Slang
A small, flexible and extensible front-end for GLSL.
Stars: ✭ 10 (-16.67%)
Mutual labels:  parser
Vector
A reliable, high-performance tool for building observability data pipelines.
Stars: ✭ 8,736 (+72700%)
Mutual labels:  parser
Metric Parser
📜 AST-based advanced mathematical parser written by Typescript.
Stars: ✭ 26 (+116.67%)
Mutual labels:  parser

PurePos

PurePos is an open-source HMM-based automatic morphological annotation tool. It can perform tagging and lemmatization at the same time, it is very fast to train, with the possibility of easy integration of symbolic rule-based components into the annotation process that can be used to boost the accuracy of the tool. The hybrid approach implemented in PurePos is especially beneficial in the case of rich morphology, highly detailed annotation schemes and if a small amount of training data is available.

It is distributed under the permissive LGPL license.

Compiling

Dependencies: Maven 2

  1. Clone the repository
  2. $ mvn package -DskipTests

Usage

Trainig the tagger needs a corpus with the following format:

  • sentences are separated in new lines, while there are spaces between tagged words,
  • each token in a sentence must be annotated with its lemma and POS tag separated by a hashmark: word#lemma#tag.

$ java -jar purepos-<version>.jar train -m out.model [-i tagged_input.txt]

The raw text file which is to be tagged must contain:

  • sentences in new lines,
  • and words separated by spaces.

$ java -jar purepos-<version>.jar tag -m out.model [-i input.txt] [-o output.txt]

For further help:

$ java -jar purepos-<version>.jar -h

Usage: java -jar <purepos.jar> [options...] arguments...
 tag|train|dump                         : Mode selection: train for training the
                                          tagger, tag for tagging a text with the
                                          given model. Dump for getting the model printed.
 -a  (--analyzer) <analyzer>            : Set the morphological analyzer. <analyzer>
                                          can be 'none', 'integrated' or a file :
                                          <morphologicalTableFile>. The default is to
                                          use the integrated one. Tagging only option.
 -b  (--beam-theta) <theta>             : Set the beam-search limit. The default is
                                          1000. Tagging only option.
 -c  (--encoding) <encoding>            : Encoding used to read the training set, or
                                          write the results. The default is your OS
                                          default.
 -d  (--beam-decoder)                   : Use Beam Search decoder. The default is to
                                          employ the Viterbi algorithm. Tagging only
                                          option.
 -dot 									: The model objects will printed in dot format. 
										  Dump only option.
 -e  (--emission-order) <number>        : Order of emission. First order means that
                                          the given word depends only on its tag. The
                                          default is 2.  Training only option.
 -f  (--config-file) <file>             : Configuration file containing tag mappings.
                                          Defaults to do not map any tag.
 -g  (--max-guessed) <number>           : Limit the max guessed tags for each token.
                                          The default is 10. Tagging only option.
 -h  (--help)                           : Print this message.
 -i  (--input-file) <file>              : File containing the training set (for
                                          tagging) or the text to be tagged (for
                                          tagging). The default is the standard input.
 -if (--input-format) <format>          : Set the format of the input file: 'vert' for vertical,
                                          'ord' for ordinary. The default is ordinary.
 -l  (--lemma-transformation) <class>   : Chooses between the LemmaTransformation classes:
                                          "suffix" or "generalized". The default is the suffix.
 -lt (--lemma-threshold) <length>       : Sets the threshold of the GeneralizedLemmaTransformation's 
                                          decode function. Only an option, when the transformation 
                                          type is "generalized".
 -m  (--model) <modelfile>              : Specifies a path to a model file. If an
                                          existing model is given for training, the
                                          tool performs incremental training.
 -n  (--max-results) <number>           : Set the expected maximum number of tag
                                          sequences (with its score). The default is
                                          1. Tagging only option.
 -o  (--output-file) <file>             : File where the tagging output is redirected.
                                          Tagging only option.
 -of (--output-format) <format>         : Set the format of the output file: 'vert' for vertical,
                                          'ord' for ordinary. The default is ordinary.
                                          Tagging only option.
 -ps (--print-separately)				: If enabled, the model objects will printed into dedicated
                                          files, named after their object name. Dump only option.
										  The default is false.
 -r  (--rare-frequency) <threshold>     : Add only words to the suffix trie with
                                          frequency less than the given threshold. The
                                          default is 10.  Training only option.
 -s  (--suffix-length) <length>         : Use a suffix trie for guessing unknown
                                          words tags with the given maximum suffix
                                          length. The default is 10.  Training only
                                          option.
 -t  (--tag-order) <number>             : Order of tag transition. Second order means
                                          trigram tagging. The default is 2. Training
                                          only option.

Java

For details please check the Demo.java file: https://github.com/ppke-nlpg/purepos/blob/master/src/main/java/hu/ppke/itk/nlpg/purepos/Demo.java

Python

A wrapper package for Python is available at https://github.com/ppke-nlpg/purepos.py

Configuration file

One can provide a configuration file with -f.

First of all, the user can describe mappings of morphosyntactic tags. A mapping is composed of two parts: a regular expression pattern and a replacement string. For details check our paper (Orosz et al. 2013).

An example configuration file which maps Latin HuMor tags (with |lat) to standard ones:

    <?xml version="1.0" encoding="UTF-8" ?>
    <config>
    <tag_mapping pattern="^(.*)(\|lat)(.*)$" to="$1$3" />
	</config>

Mapping of lemmata is also possible with config files. One can e.g. use a "+" character to mark boundaries of compound words. Utilizing the configuration below:

  1. the "+" is omitted from the lemmata during training (e.g. word+form#word+form#NN is learnt as word+form#wordform#NN)

  2. if the preanalyzed input has any lemma separated with the marker character, corrresponding probabilites are calculated through mapping, while the output will remain as in the preanylyzed input. (e.g. word+form{{word+form[NN]}} is processed as calculating lemma probabilites for the lemma wordform while the output of the tagger will be word+form#wordform#[NN])

     <?xml version="1.0" encoding="UTF-8" ?>
     <config>
     <lemma_mapping pattern="[+]" to="" /> 
     </config>
    

PurePos is also able to mark analyses that are unknown to the analyzer used. For this one can use a marker character. An example configuration file for utilizing the * character as a marker is:

	<?xml version="1.0" encoding="UTF-8" ?>
	<config>
	<guessed_marker>*</guessed_marker>
	</config>

One can tune the the lemmatization model by assigning a fixed w weight to the suffix model. (In this case the score of the unigram model will be 1-w). An example for this:

	<?xml version="1.0" encoding="UTF-8" ?>
	<config>
	<suffix_model_weight>1.0</suffix_model_weight>
	</config>

Preanalyzed input

The disambiguator tool can be pipelined with a morphological analyzer or guesser. For this, the analyzer must provide analyses in the following format (with or without the scores):

    token{{lemma1[tag1]$$score1||lemma2[tag2]$$score2}}
    token{{lemma1[tag1]||lemma2[tag2]}}

Scores given in the input are incorporated as lexical log-probabilities. Please note, that the brackets will be part of the tag. Therefore, inputting houses{{house[NNS]}} will result in houses#house#[NNS].

References

This tool is also integrated into the e-magyar language processing system. It is called emTag.

If you use the tool, please cite the following papers:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].