Mottl / ru_punkt

Licence: MIT license

Russian language support for NLTK's PunktSentenceTokenizer

Projects that are alternatives of or similar to ru punkt

10000sentences

10,000 sentences: an Android app to help you learn new words in foreign languages

Stars: ✭ 116 (+136.73%)

Mutual labels: sentence

youtube-video-maker

📹 A tool for automatic video creation and uploading on YouTube

Stars: ✭ 134 (+173.47%)

Mutual labels: nltk

Reuters-21578-Classification

Text classification with Reuters-21578 datasets using Gensim Word2Vec and Keras LSTM

Stars: ✭ 44 (-10.2%)

Mutual labels: nltk

probabilistic nlg

Tensorflow Implementation of Stochastic Wasserstein Autoencoder for Probabilistic Sentence Generation (NAACL 2019).

Stars: ✭ 28 (-42.86%)

Mutual labels: sentence

nodejs-support

한국어 형태소 및 구문 분석기의 모음인, KoalaNLP의 Javascript(Node.js) 버전입니다.

Stars: ✭ 81 (+65.31%)

Mutual labels: sentence

nlp-akash

Natural Language Processing notes and implementations.

Stars: ✭ 66 (+34.69%)

Mutual labels: nltk

nltk-api-server

API server for NLTK

Stars: ✭ 23 (-53.06%)

Mutual labels: nltk

mystem-scala

Morphological analyzer `mystem` (Russian language) wrapper for JVM languages

Stars: ✭ 21 (-57.14%)

Mutual labels: russian-specific

character-extraction

Extracts character names from a text file and performs analysis of text sentences containing the names.

Stars: ✭ 40 (-18.37%)

Mutual labels: nltk

nltk-maxent-pos-tagger

maximum entropy based part-of-speech tagger for NLTK

Stars: ✭ 45 (-8.16%)

Mutual labels: nltk

ipython-notebook-nltk

An introduction to Natural Language processing using NLTK with python.

Stars: ✭ 19 (-61.22%)

Mutual labels: nltk

PragmaticSegmenterNet

Port of PragmaticSegmenter for sentence boundary detection

Stars: ✭ 25 (-48.98%)

Mutual labels: sentence

Stock-Analyser

📈 Stocks technical analysis code collection and Stocks data platform.

Stars: ✭ 30 (-38.78%)

Mutual labels: nltk

NRCLex

An affect generator based on TextBlob and the NRC affect lexicon. Note that lexicon license is for research purposes only.

Stars: ✭ 42 (-14.29%)

Mutual labels: nltk

reddit-opinion-mining

Sentiment analysis and opinion mining of Reddit data.

Stars: ✭ 15 (-69.39%)

Mutual labels: nltk

nlp workshop odsc europe20

Extensive tutorials for the Advanced NLP Workshop in Open Data Science Conference Europe 2020. We will leverage machine learning, deep learning and deep transfer learning to learn and solve popular tasks using NLP including NER, Classification, Recommendation \ Information Retrieval, Summarization, Classification, Language Translation, Q&A and T…

Stars: ✭ 127 (+159.18%)

Mutual labels: nltk

Репозиторий практик факультета ИТиАБД направления Прикладной Информатики в Финансовом Университете при Правительстве РФ

Stars: ✭ 26 (-46.94%)

Mutual labels: russian-specific

Introduction-to-text-mining-with-Python

Lectures in Urban Data Science Lab, Seoul

Stars: ✭ 25 (-48.98%)

Mutual labels: nltk

swfk

“Snake wrangling for kids”: the Russian translation. Русский перевод книги «Snake Wrangling for Kids»

Stars: ✭ 24 (-51.02%)

Mutual labels: russian-specific

match-casing

Match the case of `value` to that of `base`

Stars: ✭ 13 (-73.47%)

Mutual labels: sentence

View All Similar Projects ➔

ru_punkt

Russian language support for NLTK's PunktSentenceTokenizer

ru_punkt is a part of nltk_data since 2019-07-04

Instalation

Install NLTK python package:

pip install nltk

Download punkt data:

import nltk
nltk.download('punkt')

Usage

import nltk

text = "Ай да А.С. Пушкин! Ай да сукин сын!"
print("Before:", nltk.sent_tokenize(text))
print("After:", nltk.sent_tokenize(text, language="russian"))

Output:

Before: ['Ай да А.С.', 'Пушкин!', 'Ай да сукин сын!']
After: ['Ай да А.С. Пушкин!', 'Ай да сукин сын!']

Training data

Data for sentence tokenization was taken from 3 sources:
– Articles from Russian Wikipedia (about 1 million sentences);
– Common Russian abbreviations from Russian orthographic dictionary, edited by V. V. Lopatin;
– Generated names initials.

Implementation notes

After some research it was found that the single params.abbrev_types performs better than together with params.collocations and params.ortho_content, so the latter were removed from the trained tokenizer.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Mottl / ru_punkt

Labels

Projects that are alternatives of or similar to ru punkt

ru_punkt

Instalation

Usage

Training data

Implementation notes