Rust re-implementation of OpenFST - library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). A Python binding is also available.

Stars: ✭ 104 (+550%)

Mutual labels: tokenizer

KosherCocoa

My Objective-C port of KosherJava. KosherCocoa enables you to perform sunrise-based and sunset-based calculations for Jewish prayer and calendar.

Stars: ✭ 49 (+206.25%)

Mutual labels: hebrew

vscode-blockman

VSCode extension to highlight nested code blocks

Stars: ✭ 233 (+1356.25%)

Mutual labels: tokenizer

bredon

A modern CSS value compiler in JavaScript

Stars: ✭ 39 (+143.75%)

Mutual labels: tokenizer

berserker

Berserker - BERt chineSE woRd toKenizER

Stars: ✭ 17 (+6.25%)

Mutual labels: tokenizer

OpenHebrewBible

Open Hebrew Bible Project; aligning BHS with WLC; bridging ETCBC, OpenScriptures & Berean data on Hebrew Bible

Stars: ✭ 43 (+168.75%)

Mutual labels: hebrew

DartBible-Flutter

cross-platform mobile bible app [Android & iOS / iPhone / iPad]; written in Dart programming language

Stars: ✭ 26 (+62.5%)

Mutual labels: hebrew

elasticsearch-plugins

Some native scoring script plugins for elasticsearch

Stars: ✭ 30 (+87.5%)

Mutual labels: tokenizer

text2text

Text2Text: Cross-lingual natural language processing and generation toolkit

Stars: ✭ 188 (+1075%)

Mutual labels: tokenizer

farasapy

A Python implementation of Farasa toolkit

Stars: ✭ 69 (+331.25%)

Mutual labels: tokenizer

tokenizer

Tokenize CSS according to the CSS Syntax

Stars: ✭ 52 (+225%)

Mutual labels: tokenizer

cang-jie

Chinese tokenizer for tantivy, based on jieba-rs

Stars: ✭ 48 (+200%)

Mutual labels: tokenizer

PaddleTokenizer

使用 PaddlePaddle 实现基于深度神经网络的中文分词引擎 | A DNN Chinese Tokenizer by Using PaddlePaddle

Stars: ✭ 14 (-12.5%)

Mutual labels: tokenizer

simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Stars: ✭ 32 (+100%)

Mutual labels: tokenizer

View All Similar Projects ➔

Hebrew Tokenizer

A very simple python tokenizer for Hebrew text.
No batteries included - No dependencies needed!

UPDATES

13/11/21

Added support for alphanumeric non english characters.

23/10/21

Hebrew accents/punctuation(ניקוד) support added.

28/04/21

accented letters support was added - great work Daniel 👊

23/10/2020

Added support for Python 3.8+.
The Code look like shit now BUT it's working😄.
In the near future I will wrap it up nicely.

Installation

via pip
1. pip install hebrew_tokenizer
from source
1. Download / Clone
2. python setup.py install

Usage

import hebrew_tokenizer as ht
hebrew_text = "אתמול, 8.6.2018, בשעה 17:00 הלכתי עם אמא למכולת"
tokens = ht.tokenize(hebrew_text)  # tokenize returns a generator!
for grp, token, token_num, (start_index, end_index) in tokens:
    print('{}, {}'.format(grp, token))

>>> HEBREW, 'אתמול'
>>> PUNCTUATION, ',' 
>>> DATE, '8.6.2018'
>>> PUNCTUATION, ',' 
>>> HEBREW, 'בשעה'
>>> HOUR, '17:00'
>>> HEBREW, 'הלכתי'
>>> HEBREW, 'עם'
>>> HEBREW, 'אמא'
>>> HEBREW, 'למכולת'

# by default it doesn't return whitespaces but it can be done easily
tokens = ht.tokenize(hebrew_text, with_whitespaces=True)  # notice the with_whitespace flag
for grp, token, token_num, (start_index, end_index) in tokens:
    print('{}, {}'.format(grp, token))
  
>>> HEBREW, 'אתמול'
>>> WHITESPACE, ''
>>> PUNCTUATION, ',' 
>>> WHITESPACE, ''
>>> DATE, '8.6.2018'
>>> PUNCTUATION, ','
>>> WHITESPACE, ''
>>> HEBREW, 'בשעה'
>>> WHITESPACE, ''
>>> HOUR, '17:00'
>>> WHITESPACE, ''
>>> HEBREW, 'הלכתי'
>>> WHITESPACE, ''
>>> HEBREW, 'עם'
>>> WHITESPACE, ''
>>> HEBREW, 'אמא'
>>> WHITESPACE, ''
>>> HEBREW, 'למכולת'

Disclaimer

This is NOT a POS tagger.
If that is what you are looking for - check out yap.

Contribute

Found a special case where the tokenizer fails?

Try to fix it on your own by improving the regex patterns the tokenizer is based on.
Make sure that your improvement doesn't break the other scenarios in test.py
Add you special case to test.py
Commit a pull request.

NLPH

For other great Hebrew resources check out NLPH

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

YontiLevin / Hebrew-Tokenizer

Programming Languages

Labels

Projects that are alternatives of or similar to Hebrew-Tokenizer

Hebrew Tokenizer

UPDATES

13/11/21

23/10/21

28/04/21

23/10/2020

Installation

Usage

Disclaimer

Contribute

NLPH