All Projects → aatimofeev → spacy_russian_tokenizer

aatimofeev / spacy_russian_tokenizer

Licence: other
Custom Russian tokenizer for spaCy

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to spacy russian tokenizer

nerus
Large silver standart Russian corpus with NER, morphology and syntax markup
Stars: ✭ 47 (+34.29%)
Mutual labels:  russian
spacy-server
🦜 Containerized HTTP API for industrial-strength NLP via spaCy and sense2vec
Stars: ✭ 58 (+65.71%)
Mutual labels:  tokenization
JavaneseBackend
Javanese.online website back-end
Stars: ✭ 31 (-11.43%)
Mutual labels:  russian
ukrainian-typographic-layouts
Типографічні розкладки для української та російської мови / Типографские раскладки для украинского и русского языка
Stars: ✭ 69 (+97.14%)
Mutual labels:  russian
SynapseOS
SynapseOS - модульная операционная система на языке C.
Stars: ✭ 93 (+165.71%)
Mutual labels:  russian
react-challenge-sort-and-search
Первый выпуск React Challenge: сортировка и поиск данных
Stars: ✭ 22 (-37.14%)
Mutual labels:  russian
js-stack-from-scratch
🌺 Russian translation of "JavaScript Stack from Scratch" from the React-Theming developers https://github.com/sm-react/react-theming
Stars: ✭ 394 (+1025.71%)
Mutual labels:  russian
bem-flashcards
Simple single-page flashcards application based on the bem-core/bem-history and BEM methodology
Stars: ✭ 19 (-45.71%)
Mutual labels:  russian
libmorph
libmorph rus/ukr - fast & accurate morphological analyzer/analyses for Russian and Ukrainian
Stars: ✭ 16 (-54.29%)
Mutual labels:  russian
wink-tokenizer
Multilingual tokenizer that automatically tags each token with its type
Stars: ✭ 51 (+45.71%)
Mutual labels:  tokenization
FCH-TTS
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。
Stars: ✭ 154 (+340%)
Mutual labels:  russian
ling
Natural Language Processing Toolkit in Golang
Stars: ✭ 57 (+62.86%)
Mutual labels:  tokenization
podcast
🦆 Выпуски подкаста Goose&Duck
Stars: ✭ 19 (-45.71%)
Mutual labels:  russian
nlp-cheat-sheet-python
NLP Cheat Sheet, Python, spacy, LexNPL, NLTK, tokenization, stemming, sentence detection, named entity recognition
Stars: ✭ 69 (+97.14%)
Mutual labels:  tokenization
r-geo-course
An introductory course on using R for geographic data visualisation and analysis (in Russian).
Stars: ✭ 18 (-48.57%)
Mutual labels:  russian
DolboNet
Русскоязычный чат-бот для Discord на архитектуре Transformer
Stars: ✭ 53 (+51.43%)
Mutual labels:  russian
android-interview
Коллекция вопросов к собеседованию на позицию Android-разработчика на русском языке.
Stars: ✭ 74 (+111.43%)
Mutual labels:  russian
ds
👨‍🔬 In Russian: Обновляемая структурированная подборка бесплатных ресурсов по тематикам Data Science: курсы, книги, открытые данные, блоги и готовые решения.
Stars: ✭ 102 (+191.43%)
Mutual labels:  russian
navec
Compact high quality word embeddings for Russian language
Stars: ✭ 118 (+237.14%)
Mutual labels:  russian
rutimeparser
Recognize date and time in russian text and return datetime.datetime.
Stars: ✭ 17 (-51.43%)
Mutual labels:  russian

spacy_russian_tokenizer: Russian segmentation and tokenization rules for spaCy

Tokenization in Russian language is not that simple topic when it comes to compound words connected by hyphens. Some of them (i.e. "какой-то", "кое-что", "бизнес-ланч") should be treated as single unit, while other (i.e. "суп-харчо", "инженер-программист") treated as multiple tokens. Correct tokenization is especially important when training language model, because in most training datasets (i.e. SynTagRus) tokens are split or merged correctly and wrong tokenization reduces model's quality. Example of default behaviour:

from spacy.lang.ru import Russian
text = "Не ветер, а какой-то ураган!"
nlp = Russian()
doc = nlp(text)
print([token.text for token in doc])
# ['Не', 'ветер', ',', 'а', 'какой', '-', 'то', 'ураган', '!']
# Notice that word "какой-то" is split into three tokens.

This package uses spaCy Matcher API to create rules for specific cases and exceptions in Russian language.

Installation

pip install git+https://github.com/aatimofeev/spacy_russian_tokenizer.git

Implementation

Basically, the package is just a collection of manually tunes Matcher patterns. Most patterns were acquired from SynTagRus vocabulary and lemma dictionary from National Russian Language Corpus (НКРЯ).

Usage

Core patterns are collected in MERGE_PATTERNS variable.

from spacy.lang.ru import Russian
from spacy_russian_tokenizer import RussianTokenizer, MERGE_PATTERNS
text = "Не ветер, а какой-то ураган!"
nlp = Russian()
doc = nlp(text)
russian_tokenizer = RussianTokenizer(nlp, MERGE_PATTERNS)
nlp.add_pipe(russian_tokenizer, name='russian_tokenizer')
doc = nlp(text)
print([token.text for token in doc])
# ['Не', 'ветер', ',', 'а', 'какой-то', 'ураган', '!']
# Notice that word "какой-то" remains a single token. 

One can also add patterns, found in SynTagRus but absent in National Russian Language Corpus

from spacy.lang.ru import Russian
from spacy_russian_tokenizer import RussianTokenizer, MERGE_PATTERNS, SYNTAGRUS_RARE_CASES
text = "«Фобос-Грунт» — российская автоматическая межпланетная станция (АМС)."
nlp = Russian()
doc = nlp(text)
russian_tokenizer = RussianTokenizer(nlp, MERGE_PATTERNS + SYNTAGRUS_RARE_CASES)
nlp.add_pipe(russian_tokenizer, name='russian_tokenizer')
doc = nlp(text)
print([token.text for token in doc])
# ['«', 'Фобос-Грунт', '»', '—', 'российская', 'автоматическая', 'межпланетная', 'станция', '(', 'АМС', ')', '.']
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].