Alternatives and detailed information of spacy_russian_tokenizer

aatimofeev / spacy_russian_tokenizer

Licence: other

Custom Russian tokenizer for spaCy

Programming Languages

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to spacy russian tokenizer

nerus

Large silver standart Russian corpus with NER, morphology and syntax markup

Stars: ✭ 47 (+34.29%)

Mutual labels: russian

spacy-server

🦜 Containerized HTTP API for industrial-strength NLP via spaCy and sense2vec

Stars: ✭ 58 (+65.71%)

Mutual labels: tokenization

JavaneseBackend

Javanese.online website back-end

Stars: ✭ 31 (-11.43%)

Mutual labels: russian

ukrainian-typographic-layouts

Типографічні розкладки для української та російської мови / Типографские раскладки для украинского и русского языка

Stars: ✭ 69 (+97.14%)

Mutual labels: russian

SynapseOS

SynapseOS - модульная операционная система на языке C.

Stars: ✭ 93 (+165.71%)

Mutual labels: russian

react-challenge-sort-and-search

Первый выпуск React Challenge: сортировка и поиск данных

Stars: ✭ 22 (-37.14%)

Mutual labels: russian

js-stack-from-scratch

🌺 Russian translation of "JavaScript Stack from Scratch" from the React-Theming developers https://github.com/sm-react/react-theming

Stars: ✭ 394 (+1025.71%)

Mutual labels: russian

bem-flashcards

Simple single-page flashcards application based on the bem-core/bem-history and BEM methodology

Stars: ✭ 19 (-45.71%)

Mutual labels: russian

libmorph

libmorph rus/ukr - fast & accurate morphological analyzer/analyses for Russian and Ukrainian

Stars: ✭ 16 (-54.29%)

Mutual labels: russian

wink-tokenizer

Multilingual tokenizer that automatically tags each token with its type

Stars: ✭ 51 (+45.71%)

Mutual labels: tokenization

FCH-TTS

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Stars: ✭ 154 (+340%)

Mutual labels: russian

ling

Natural Language Processing Toolkit in Golang

Stars: ✭ 57 (+62.86%)

Mutual labels: tokenization

podcast

🦆 Выпуски подкаста Goose&Duck

Stars: ✭ 19 (-45.71%)

Mutual labels: russian

nlp-cheat-sheet-python

NLP Cheat Sheet, Python, spacy, LexNPL, NLTK, tokenization, stemming, sentence detection, named entity recognition

Stars: ✭ 69 (+97.14%)

Mutual labels: tokenization

r-geo-course

An introductory course on using R for geographic data visualisation and analysis (in Russian).

Stars: ✭ 18 (-48.57%)

Mutual labels: russian

DolboNet

Русскоязычный чат-бот для Discord на архитектуре Transformer

Stars: ✭ 53 (+51.43%)

Mutual labels: russian

android-interview

Коллекция вопросов к собеседованию на позицию Android-разработчика на русском языке.

Stars: ✭ 74 (+111.43%)

Mutual labels: russian

👨‍🔬 In Russian: Обновляемая структурированная подборка бесплатных ресурсов по тематикам Data Science: курсы, книги, открытые данные, блоги и готовые решения.

Stars: ✭ 102 (+191.43%)

Mutual labels: russian

navec

Compact high quality word embeddings for Russian language

Stars: ✭ 118 (+237.14%)

Mutual labels: russian

rutimeparser

Recognize date and time in russian text and return datetime.datetime.

Stars: ✭ 17 (-51.43%)

Mutual labels: russian

View All Similar Projects ➔

spacy_russian_tokenizer: Russian segmentation and tokenization rules for spaCy

Tokenization in Russian language is not that simple topic when it comes to compound words connected by hyphens. Some of them (i.e. "какой-то", "кое-что", "бизнес-ланч") should be treated as single unit, while other (i.e. "суп-харчо", "инженер-программист") treated as multiple tokens. Correct tokenization is especially important when training language model, because in most training datasets (i.e. SynTagRus) tokens are split or merged correctly and wrong tokenization reduces model's quality. Example of default behaviour:

from spacy.lang.ru import Russian
text = "Не ветер, а какой-то ураган!"
nlp = Russian()
doc = nlp(text)
print([token.text for token in doc])
# ['Не', 'ветер', ',', 'а', 'какой', '-', 'то', 'ураган', '!']
# Notice that word "какой-то" is split into three tokens.

This package uses spaCy Matcher API to create rules for specific cases and exceptions in Russian language.

Installation

pip install git+https://github.com/aatimofeev/spacy_russian_tokenizer.git

Implementation

Basically, the package is just a collection of manually tunes Matcher patterns. Most patterns were acquired from SynTagRus vocabulary and lemma dictionary from National Russian Language Corpus (НКРЯ).

Usage

Core patterns are collected in MERGE_PATTERNS variable.

from spacy.lang.ru import Russian
from spacy_russian_tokenizer import RussianTokenizer, MERGE_PATTERNS
text = "Не ветер, а какой-то ураган!"
nlp = Russian()
doc = nlp(text)
russian_tokenizer = RussianTokenizer(nlp, MERGE_PATTERNS)
nlp.add_pipe(russian_tokenizer, name='russian_tokenizer')
doc = nlp(text)
print([token.text for token in doc])
# ['Не', 'ветер', ',', 'а', 'какой-то', 'ураган', '!']
# Notice that word "какой-то" remains a single token.

One can also add patterns, found in SynTagRus but absent in National Russian Language Corpus

from spacy.lang.ru import Russian
from spacy_russian_tokenizer import RussianTokenizer, MERGE_PATTERNS, SYNTAGRUS_RARE_CASES
text = "«Фобос-Грунт» — российская автоматическая межпланетная станция (АМС)."
nlp = Russian()
doc = nlp(text)
russian_tokenizer = RussianTokenizer(nlp, MERGE_PATTERNS + SYNTAGRUS_RARE_CASES)
nlp.add_pipe(russian_tokenizer, name='russian_tokenizer')
doc = nlp(text)
print([token.text for token in doc])
# ['«', 'Фобос-Грунт', '»', '—', 'российская', 'автоматическая', 'межпланетная', 'станция', '(', 'АМС', ')', '.']

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

aatimofeev / spacy_russian_tokenizer

Programming Languages

Labels

Projects that are alternatives of or similar to spacy russian tokenizer

spacy_russian_tokenizer: Russian segmentation and tokenization rules for spaCy

Installation

Implementation

Usage