All Projects → YontiLevin → Hebrew-Tokenizer

YontiLevin / Hebrew-Tokenizer

Licence: MIT License
A very simple python tokenizer for Hebrew text.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Hebrew-Tokenizer

neural tokenizer
Tokenize English sentences using neural networks.
Stars: ✭ 64 (+300%)
Mutual labels:  tokenizer
ilmulti
Tooling to play around with multilingual machine translation for Indian Languages.
Stars: ✭ 19 (+18.75%)
Mutual labels:  tokenizer
UniqueBible
A cross-platform bible application, integrated with high-quality resources and amazing features, running offline in Windows, macOS and Linux
Stars: ✭ 61 (+281.25%)
Mutual labels:  hebrew
jargon
Tokenizers and lemmatizers for Go
Stars: ✭ 98 (+512.5%)
Mutual labels:  tokenizer
liblex
C library for Lexical Analysis
Stars: ✭ 25 (+56.25%)
Mutual labels:  tokenizer
mystem-scala
Morphological analyzer `mystem` (Russian language) wrapper for JVM languages
Stars: ✭ 21 (+31.25%)
Mutual labels:  tokenizer
rustfst
Rust re-implementation of OpenFST - library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). A Python binding is also available.
Stars: ✭ 104 (+550%)
Mutual labels:  tokenizer
KosherCocoa
My Objective-C port of KosherJava. KosherCocoa enables you to perform sunrise-based and sunset-based calculations for Jewish prayer and calendar.
Stars: ✭ 49 (+206.25%)
Mutual labels:  hebrew
vscode-blockman
VSCode extension to highlight nested code blocks
Stars: ✭ 233 (+1356.25%)
Mutual labels:  tokenizer
bredon
A modern CSS value compiler in JavaScript
Stars: ✭ 39 (+143.75%)
Mutual labels:  tokenizer
berserker
Berserker - BERt chineSE woRd toKenizER
Stars: ✭ 17 (+6.25%)
Mutual labels:  tokenizer
OpenHebrewBible
Open Hebrew Bible Project; aligning BHS with WLC; bridging ETCBC, OpenScriptures & Berean data on Hebrew Bible
Stars: ✭ 43 (+168.75%)
Mutual labels:  hebrew
DartBible-Flutter
cross-platform mobile bible app [Android & iOS / iPhone / iPad]; written in Dart programming language
Stars: ✭ 26 (+62.5%)
Mutual labels:  hebrew
elasticsearch-plugins
Some native scoring script plugins for elasticsearch
Stars: ✭ 30 (+87.5%)
Mutual labels:  tokenizer
text2text
Text2Text: Cross-lingual natural language processing and generation toolkit
Stars: ✭ 188 (+1075%)
Mutual labels:  tokenizer
farasapy
A Python implementation of Farasa toolkit
Stars: ✭ 69 (+331.25%)
Mutual labels:  tokenizer
tokenizer
Tokenize CSS according to the CSS Syntax
Stars: ✭ 52 (+225%)
Mutual labels:  tokenizer
cang-jie
Chinese tokenizer for tantivy, based on jieba-rs
Stars: ✭ 48 (+200%)
Mutual labels:  tokenizer
PaddleTokenizer
使用 PaddlePaddle 实现基于深度神经网络的中文分词引擎 | A DNN Chinese Tokenizer by Using PaddlePaddle
Stars: ✭ 14 (-12.5%)
Mutual labels:  tokenizer
simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Stars: ✭ 32 (+100%)
Mutual labels:  tokenizer

PyPI download total PyPI version fury.io MIT license

Hebrew Tokenizer

A very simple python tokenizer for Hebrew text.
No batteries included - No dependencies needed!

UPDATES

13/11/21

Added support for alphanumeric non english characters.

23/10/21

Hebrew accents/punctuation(ניקוד) support added.

28/04/21

accented letters support was added - great work Daniel 👊

23/10/2020

Added support for Python 3.8+.
The Code look like shit now BUT it's working😄.
In the near future I will wrap it up nicely.

Installation

  1. via pip
    1. pip install hebrew_tokenizer
  2. from source
    1. Download / Clone
    2. python setup.py install

Usage

import hebrew_tokenizer as ht
hebrew_text = "אתמול, 8.6.2018, בשעה 17:00 הלכתי עם אמא למכולת"
tokens = ht.tokenize(hebrew_text)  # tokenize returns a generator!
for grp, token, token_num, (start_index, end_index) in tokens:
    print('{}, {}'.format(grp, token))

>>> HEBREW, 'אתמול'
>>> PUNCTUATION, ',' 
>>> DATE, '8.6.2018'
>>> PUNCTUATION, ',' 
>>> HEBREW, 'בשעה'
>>> HOUR, '17:00'
>>> HEBREW, 'הלכתי'
>>> HEBREW, 'עם'
>>> HEBREW, 'אמא'
>>> HEBREW, 'למכולת'

# by default it doesn't return whitespaces but it can be done easily
tokens = ht.tokenize(hebrew_text, with_whitespaces=True)  # notice the with_whitespace flag
for grp, token, token_num, (start_index, end_index) in tokens:
    print('{}, {}'.format(grp, token))
  
>>> HEBREW, 'אתמול'
>>> WHITESPACE, ''
>>> PUNCTUATION, ',' 
>>> WHITESPACE, ''
>>> DATE, '8.6.2018'
>>> PUNCTUATION, ','
>>> WHITESPACE, ''
>>> HEBREW, 'בשעה'
>>> WHITESPACE, ''
>>> HOUR, '17:00'
>>> WHITESPACE, ''
>>> HEBREW, 'הלכתי'
>>> WHITESPACE, ''
>>> HEBREW, 'עם'
>>> WHITESPACE, ''
>>> HEBREW, 'אמא'
>>> WHITESPACE, ''
>>> HEBREW, 'למכולת'

Disclaimer

This is NOT a POS tagger.
If that is what you are looking for - check out yap.

Contribute

Found a special case where the tokenizer fails?

  1. Try to fix it on your own by improving the regex patterns the tokenizer is based on.
  2. Make sure that your improvement doesn't break the other scenarios in test.py
  3. Add you special case to test.py
  4. Commit a pull request.

NLPH

For other great Hebrew resources check out NLPH

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].