Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → maxoodf → Russian_news_corpus

maxoodf / Russian_news_corpus

Licence: apache-2.0

Russian mass media stemmed texts corpus / Корпус лемматизированных (морфологически нормализованных) текстов российских СМИ

Labels

machine-learning nlp ml text word2vec nlp-machine-learning russian corpus articles

Projects that are alternatives of or similar to Russian news corpus

Typographie

Web service for preparation of Russian texts for the web publication

Stars: ✭ 12 (-84.21%)

Mutual labels: russian, articles, text

sentiment-analysis-of-tweets-in-russian

Sentiment analysis of tweets in Russian using Convolutional Neural Networks (CNN) with Word2Vec embeddings.

Stars: ✭ 51 (-32.89%)

Mutual labels: word2vec, nlp-machine-learning

lda2vec

Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019

Stars: ✭ 27 (-64.47%)

Mutual labels: text, word2vec

sent2vec

How to encode sentences in a high-dimensional vector space, a.k.a., sentence embedding.

Stars: ✭ 99 (+30.26%)

Mutual labels: word2vec, nlp-machine-learning

NLP-Natural-Language-Processing

Projects and useful articles / links

Stars: ✭ 149 (+96.05%)

Mutual labels: articles, nlp-machine-learning

word2vec-tsne

Google News and Leo Tolstoy: Visualizing Word2Vec Word Embeddings using t-SNE.

Stars: ✭ 59 (-22.37%)

Mutual labels: word2vec, nlp-machine-learning

MLSummerSchool

Материалы факультатива по машинному обучению и искусственному интеллекту

Stars: ✭ 27 (-64.47%)

Mutual labels: ml, russian

Jbook

Notes about programming, advices, algorithms and a lot of good stuff with Java

Stars: ✭ 233 (+206.58%)

Mutual labels: russian, articles

Lmdb Embeddings

Fast word vectors with little memory usage in Python

Stars: ✭ 404 (+431.58%)

Mutual labels: word2vec, text

Nlp chinese corpus

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

Stars: ✭ 6,656 (+8657.89%)

Mutual labels: corpus, word2vec

Click2analyze Androiddevchallenge

An app to analyze the text and fixing the anomaly of the message that deviates from what is standard, normal, or expected. #AndroidDevChallenge

Stars: ✭ 20 (-73.68%)

Mutual labels: ml, nlp-machine-learning

navec

Compact high quality word embeddings for Russian language

Stars: ✭ 118 (+55.26%)

Mutual labels: word2vec, russian

text-classification-cn

中文文本分类实践，基于搜狗新闻语料库，采用传统机器学习方法以及预训练模型等方法

Stars: ✭ 81 (+6.58%)

Mutual labels: word2vec, corpus

Machine-Learning-Projects-2

No description or website provided.

Stars: ✭ 23 (-69.74%)

Mutual labels: ml, nlp-machine-learning

SENet-for-Weakly-Supervised-Relation-Extraction

No description or website provided.

Stars: ✭ 39 (-48.68%)

Mutual labels: ml, nlp-machine-learning

NTUA-slp-nlp

💻Speech and Natural Language Processing (SLP & NLP) Lab Assignments for ECE NTUA

Stars: ✭ 19 (-75%)

Mutual labels: word2vec, nlp-machine-learning

Eyo

🦔 CLI for restoring the letter «ё» (yo) in russian texts

Stars: ✭ 119 (+56.58%)

Mutual labels: russian, text

Pzad

Курс "Прикладные задачи анализа данных" (ВМК, МГУ имени М.В. Ломоносова)

Stars: ✭ 160 (+110.53%)

Mutual labels: russian, ml

wordfish-python

extract relationships from standardized terms from corpus of interest with deep learning 🐟

Stars: ✭ 19 (-75%)

Mutual labels: word2vec, corpus

Mldm

потоковый курс "Машинное обучение и анализ данных (Machine Learning and Data Mining)" на факультете ВМК МГУ имени М.В. Ломоносова

Stars: ✭ 35 (-53.95%)

Mutual labels: russian, ml

View All Similar Projects ➔

Russian mass media stemmed texts corpus

Russian mass media (27 top on-line sources) articles collection for the period of 04.2016 - 03.2017. Articles are stemmed and separated by '\n' char delimiter.
Also, the original collection (without stemming) can be downloaded here.

Size: ~ 4.5 GB
Articles: ~ 1 500 000
Words, total: ~ 360 000 000
Words, unique: ~ 5 178 821
Vocabulary size: 435 114 (word frequency > 10)

The corpus could be useful in NLP projects, word2vec models training and other ML algorithms developing.

HOWTO

The file is compressed by bzip2 utility and split to 49M parts.
Execute the following commands to get the corpus in txt format:

git clone https://github.com/maxoodf/russian_news_corpus.git
cd ./russian_news_corpus
cat ./russian_news.txt.bz2_a* | bzip2 -d > ./russian_news.txt

Корпус лемматизированных текстов российских СМИ

Коллекция лемматизированных (морфологически нормализованных) текстов российских СМИ (27 ведущих он-лайн порталов) за период 04.2016 - 03.2017. Статьи разделены символом '\n'.
Коллекция тексов, так же, доступна без лемматизации.

Размер корпуса: ~ 4.5 GB
Статей: ~ 1 500 000
Слов, всего: ~ 360 000 000
Слов, уникальных: ~ 5 178 821
Размер словаря: 435 114 (частота слова > 10)

Назначение данного корпуса - исследования, связанные с машинной обработкой текстов, создание word2vec моделей, алгоритмов машинного обучения и т.д.

Как загрузить корпус

Файл с содержимым корпуса сжат и разбит на части по 49М. Необходимо выполнить следующие команды для получения исходного файла в текстовом формате:

git clone https://github.com/maxoodf/russian_news_corpus.git
cd ./russian_news_corpus
cat ./russian_news.txt.bz2_a* | bzip2 -d > ./russian_news.txt

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 76

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗