All Projects → amir-zeldes → HebPipe

amir-zeldes / HebPipe

Licence: other
An NLP pipeline for Hebrew

Programming Languages

Lex
420 projects
python
139335 projects - #7 most used programming language
HTML
75241 projects

Projects that are alternatives of or similar to HebPipe

yap
Yet Another (natural language) Parser
Stars: ✭ 40 (+166.67%)
Mutual labels:  hebrew, universal-dependencies, morphological-analysis
GrammarEngine
Грамматический Словарь Русского Языка (+ английский, японский, etc)
Stars: ✭ 68 (+353.33%)
Mutual labels:  part-of-speech-tagger, morphological-analysis, lemmatization
Qutuf
Qutuf (قُطُوْف): An Arabic Morphological analyzer and Part-Of-Speech tagger as an Expert System.
Stars: ✭ 84 (+460%)
Mutual labels:  part-of-speech-tagger, morphological-analysis
Jumanpp
Juman++ (a Morphological Analyzer Toolkit)
Stars: ✭ 254 (+1593.33%)
Mutual labels:  part-of-speech-tagger, morphological-analysis
Awesome Persian Nlp Ir
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Stars: ✭ 460 (+2966.67%)
Mutual labels:  part-of-speech-tagger, morphological-analysis
zeyrek
Python morphological analyzer for Turkish language. Partial port of ZemberekNLP.
Stars: ✭ 36 (+140%)
Mutual labels:  morphological-analysis, lemmatization
udar
UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.
Stars: ✭ 15 (+0%)
Mutual labels:  morphological-analysis, lemmatization
libmorph
libmorph rus/ukr - fast & accurate morphological analyzer/analyses for Russian and Ukrainian
Stars: ✭ 16 (+6.67%)
Mutual labels:  morphological-analysis, lemmatization
simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Stars: ✭ 32 (+113.33%)
Mutual labels:  morphological-analysis, lemmatization
OpenHebrewBible
Open Hebrew Bible Project; aligning BHS with WLC; bridging ETCBC, OpenScriptures & Berean data on Hebrew Bible
Stars: ✭ 43 (+186.67%)
Mutual labels:  hebrew
UniqueBible
A cross-platform bible application, integrated with high-quality resources and amazing features, running offline in Windows, macOS and Linux
Stars: ✭ 61 (+306.67%)
Mutual labels:  hebrew
SoMeWeTa
A part-of-speech tagger with support for domain adaptation and external resources.
Stars: ✭ 20 (+33.33%)
Mutual labels:  part-of-speech-tagger
alix
A Lucene Indexer for XML, with lexical analysis (lemmatization for French)
Stars: ✭ 15 (+0%)
Mutual labels:  lemmatization
textstem
Tools for fast text stemming & lemmatization
Stars: ✭ 36 (+140%)
Mutual labels:  lemmatization
Text tone analyzer
Система, анализирующая тональность текстов и высказываний.
Stars: ✭ 15 (+0%)
Mutual labels:  lemmatization
KosherCocoa
My Objective-C port of KosherJava. KosherCocoa enables you to perform sunrise-based and sunset-based calculations for Jewish prayer and calendar.
Stars: ✭ 49 (+226.67%)
Mutual labels:  hebrew
datalinguist
Stanford CoreNLP in idiomatic Clojure.
Stars: ✭ 93 (+520%)
Mutual labels:  part-of-speech-tagger
Turkish-Lemmatizer
Lemmatization for Turkish Language
Stars: ✭ 72 (+380%)
Mutual labels:  lemmatization
rakutenma-python
Rakuten MA (Python version)
Stars: ✭ 15 (+0%)
Mutual labels:  part-of-speech-tagger
pyrrha
A language independant post correction app for POS and lemmatization
Stars: ✭ 14 (-6.67%)
Mutual labels:  lemmatization

HebPipe Hebrew NLP Pipeline

A simple NLP pipeline for Hebrew text in UTF-8 encoding, using standard components. Basic features:

  • Performs end to end processing, optionally skipping steps as needed:
    • whitespace tokenization
    • morphological segmentation (excl. insertion of unexpressed articles)
    • POS tagging
    • morphological tagging
    • dependency parsing
    • named and non-named entity type recognition (experimental)
    • coreference resolution (experimental)
  • Does not alter the input string (text reconstructible from, and alignable to output)
  • Compatible with Python 2.7/3.5+, Linux, Windows and OSX

Note that entity recognition and coreference are still in beta and offer rudimentary accuracy.

Online demo available at: (choose 'Hebrew' and enter plain text)

https://corpling.uis.georgetown.edu/xrenner/

To cite this work please refer to the paper about the morphological segmenter here:

Zeldes, Amir (2018) A Characterwise Windowed Approach to Hebrew Morphological Segmentation. In: Proceedings of the 15th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology. Brussels, Belgium.

@InProceedings{Zeldes2018,
  author    = {Amir Zeldes},
  title     = {A CharacterwiseWindowed Approach to {H}ebrew Morphological Segmentation},
  booktitle = {Proceedings of the 15th {SIGMORPHON} Workshop on Computational Research in Phonetics, Phonology, and Morphology},
  year      = {2018},
  pages      = {101--110},
  address   = {Brussels, Belgium}
}

Installation

Either install from PyPI using pip:

pip install hebpipe

And run as a module:

python -m hebpipe example_in.txt

Or install manually:

  • Clone this repository into the directory that the script should run in (git clone https://github.com/amir-zeldes/HebPipe)
  • In that directory, install the dependencies under Requirements, e.g. by running python setup.py install or pip install -r requirements.txt

Installing should get all Python dependencies, but you will need 64 bit Java installed and available on your path (see details below). Models can be downloaded automatically by the script on its first run.

Requirements

Python libraries

The NLP pipeline will run on Python 2.7+ or Python 3.5+ (2.6 and lower are not supported). Required libraries:

  • requests
  • numpy
  • scipy
  • pandas
  • depedit
  • xmltodict
  • xgboost==0.81
  • rftokenizer
  • joblib

You should be able to install these manually via pip if necessary (i.e. pip install rftokenizer etc.).

Note that some versions of Python + Windows do not install numpy correctly from pip, in which case you can download compiled binaries for your version of Python + Windows here: https://www.lfd.uci.edu/~gohlke/pythonlibs/, then run for example:

pip install c:\some_directory\numpy‑1.15.0+mkl‑cp27‑cp27m‑win_amd64.whl

External dependencies

The pipeline also requires java to be available (for parsing, tagging and morphological disambiguation). For high performance and ability to process long sentences, Java is invoked by HebPipe with 2 GB of RAM, meaning you will need a 64 bit version of Java (alternatively, replace Xmx2g in heb_pipe.py with a lower value, though longer sentences may then crash). You will also need binaries of Marmot and MaltParser 1.9.1 if you want to use POS tagging, morphology and parsing. These are not included in the distribution but the script will offer to attempt to download them if they are missing.

Model files

Model files are too large to include in the standard GitHub repository. The software will offer to download them automatically. The latest models can also be downloaded manually at https://corpling.uis.georgetown.edu/amir/download/heb_models/.

Command line usage

usage: python heb_pipe.py [OPTIONS] files

positional arguments:
  files                 File name or pattern of files to process (e.g. *.txt)

optional arguments:
  -h, --help            show this help message and exit

standard module options:
  -w, --whitespace      Perform white-space based tokenization of large word
                        forms
  -t, --tokenize        Tokenize large word forms into smaller morphological
                        segments
  -p, --pos             Do POS tagging
  -l, --lemma           Do lemmatization
  -m, --morph           Do morphological tagging
  -d, --dependencies    Parse with dependency parser
  -e, --entities        Add entity spans and types
  -c, --coref           Add coreference annotations
  -s {auto,none}, --sent {auto,none}
                        XML tag to split sentences, e.g. sent for <sent ..> or
                        none for no splitting (otherwise automatic sentence
                        splitting)
  -o {pipes,conllu,sgml}, --out {pipes,conllu,sgml}
                        Output CoNLL format, SGML or just tokenize with pipes

less common options:
  -q, --quiet           Suppress verbose messages
  -x EXTENSION, --extension EXTENSION
                        Extension for output files (default: .conllu)
  --dirout DIROUT       Optional output directory (default: this dir)
  --version             Print version number and quit

Example usage

Whitespace tokenize, tokenize morphemes, add pos, lemma, morph, dep parse with automatic sentence splitting, entity recognition and coref for one text file, output in default conllu format:

python heb_pipe.py -wtplmdec example_in.txt

OR specify no processing options (automatically assumes you want all steps)

python heb_pipe.py example_in.txt

Just tokenize a file using pipes:

python heb_pipe.py -wt -o pipes example_in.txt

Pos tag, lemmatize, add morphology and parse a pre-tokenized file, splitting sentences by existing tags:

python heb_pipe.py -plmd -s sent example_in.txt

Add full analyses to a whole directory of *.txt files, output to a specified directory:

python heb_pipe.py -wtplmdec --dirout /home/heb/out/ *.txt

Parse a tagged TT SGML file into CoNLL tabular format for treebanking, use existing tag to recognize sentence borders:

python heb_pipe.py -d -s sent example_in.tt

Input formats

The pipeline accepts the following kinds of input:

  • Plain text, with normal Hebrew whitespace behavior. Newlines are assumed to indicate a sentence break, but longer paragraphs will receive automatic sentence splitting too.
  • Gold super-tokenized: if whitespace tokenization is already done, you can leave out -w. The system expect one super-token per line in this case (e.g. is on one line)
  • Gold tokenized: if gold morphological segmentation is already done, you can input one gold token per line.
  • XML sentence tags in input: use -s TAGNAME to indicate an XML tag providing gold sentence boundaries.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].