All Projects → Mottl → ru_punkt

Mottl / ru_punkt

Licence: MIT license
Russian language support for NLTK's PunktSentenceTokenizer

Projects that are alternatives of or similar to ru punkt

10000sentences
10,000 sentences: an Android app to help you learn new words in foreign languages
Stars: ✭ 116 (+136.73%)
Mutual labels:  sentence
youtube-video-maker
📹 A tool for automatic video creation and uploading on YouTube
Stars: ✭ 134 (+173.47%)
Mutual labels:  nltk
Reuters-21578-Classification
Text classification with Reuters-21578 datasets using Gensim Word2Vec and Keras LSTM
Stars: ✭ 44 (-10.2%)
Mutual labels:  nltk
probabilistic nlg
Tensorflow Implementation of Stochastic Wasserstein Autoencoder for Probabilistic Sentence Generation (NAACL 2019).
Stars: ✭ 28 (-42.86%)
Mutual labels:  sentence
nodejs-support
한국어 형태소 및 구문 분석기의 모음인, KoalaNLP의 Javascript(Node.js) 버전입니다.
Stars: ✭ 81 (+65.31%)
Mutual labels:  sentence
nlp-akash
Natural Language Processing notes and implementations.
Stars: ✭ 66 (+34.69%)
Mutual labels:  nltk
nltk-api-server
API server for NLTK
Stars: ✭ 23 (-53.06%)
Mutual labels:  nltk
mystem-scala
Morphological analyzer `mystem` (Russian language) wrapper for JVM languages
Stars: ✭ 21 (-57.14%)
Mutual labels:  russian-specific
character-extraction
Extracts character names from a text file and performs analysis of text sentences containing the names.
Stars: ✭ 40 (-18.37%)
Mutual labels:  nltk
nltk-maxent-pos-tagger
maximum entropy based part-of-speech tagger for NLTK
Stars: ✭ 45 (-8.16%)
Mutual labels:  nltk
ipython-notebook-nltk
An introduction to Natural Language processing using NLTK with python.
Stars: ✭ 19 (-61.22%)
Mutual labels:  nltk
PragmaticSegmenterNet
Port of PragmaticSegmenter for sentence boundary detection
Stars: ✭ 25 (-48.98%)
Mutual labels:  sentence
Stock-Analyser
📈 Stocks technical analysis code collection and Stocks data platform.
Stars: ✭ 30 (-38.78%)
Mutual labels:  nltk
NRCLex
An affect generator based on TextBlob and the NRC affect lexicon. Note that lexicon license is for research purposes only.
Stars: ✭ 42 (-14.29%)
Mutual labels:  nltk
reddit-opinion-mining
Sentiment analysis and opinion mining of Reddit data.
Stars: ✭ 15 (-69.39%)
Mutual labels:  nltk
nlp workshop odsc europe20
Extensive tutorials for the Advanced NLP Workshop in Open Data Science Conference Europe 2020. We will leverage machine learning, deep learning and deep transfer learning to learn and solve popular tasks using NLP including NER, Classification, Recommendation \ Information Retrieval, Summarization, Classification, Language Translation, Q&A and T…
Stars: ✭ 127 (+159.18%)
Mutual labels:  nltk
FA
Репозиторий практик факультета ИТиАБД направления Прикладной Информатики в Финансовом Университете при Правительстве РФ
Stars: ✭ 26 (-46.94%)
Mutual labels:  russian-specific
Introduction-to-text-mining-with-Python
Lectures in Urban Data Science Lab, Seoul
Stars: ✭ 25 (-48.98%)
Mutual labels:  nltk
swfk
“Snake wrangling for kids”: the Russian translation. Русский перевод книги «Snake Wrangling for Kids»
Stars: ✭ 24 (-51.02%)
Mutual labels:  russian-specific
match-casing
Match the case of `value` to that of `base`
Stars: ✭ 13 (-73.47%)
Mutual labels:  sentence

ru_punkt

Russian language support for NLTK's PunktSentenceTokenizer

Python 2.7 Python 3x

ru_punkt is a part of nltk_data since 2019-07-04

Instalation

  1. Install NLTK python package:
pip install nltk
  1. Download punkt data:
import nltk
nltk.download('punkt')

Usage

import nltk

text = "Ай да А.С. Пушкин! Ай да сукин сын!"
print("Before:", nltk.sent_tokenize(text))
print("After:", nltk.sent_tokenize(text, language="russian"))

Output:

Before: ['Ай да А.С.', 'Пушкин!', 'Ай да сукин сын!']
After: ['Ай да А.С. Пушкин!', 'Ай да сукин сын!']

Training data

Data for sentence tokenization was taken from 3 sources:
– Articles from Russian Wikipedia (about 1 million sentences);
– Common Russian abbreviations from Russian orthographic dictionary, edited by V. V. Lopatin;
– Generated names initials.

Implementation notes

After some research it was found that the single params.abbrev_types performs better than together with params.collocations and params.ortho_content, so the latter were removed from the trained tokenizer.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].