Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → natasha → Corus

natasha / Corus

Licence: mit

Links to Russian corpora + Python functions for loading and parsing

Programming Languages

139335 projects - #7 most used programming language

Labels

jupyter-notebook nlp datasets russian

Projects that are alternatives of or similar to Corus

Datasets, tools, and benchmarks for representation learning of code.

Stars: ✭ 1,378 (+794.81%)

Mutual labels: jupyter-notebook, datasets

Therapeutics Data Commons: Machine Learning Datasets and Tasks for Therapeutics

Stars: ✭ 291 (+88.96%)

Mutual labels: jupyter-notebook, datasets

source{d} datasets ("big code") for source code analysis and machine learning on source code

Stars: ✭ 231 (+50%)

Mutual labels: jupyter-notebook, datasets

Datasets For Recommender Systems

This is a repository of a topic-centric public data sources in high quality for Recommender Systems (RS)

Stars: ✭ 564 (+266.23%)

Mutual labels: jupyter-notebook, datasets

R package to interface with OpenML

Stars: ✭ 81 (-47.4%)

Mutual labels: jupyter-notebook, datasets

Health Check ✔ is a Machine Learning Web Application made using Flask that can predict mainly three diseases i.e. Diabetes, Heart Disease, and Cancer.

Stars: ✭ 35 (-77.27%)

Mutual labels: jupyter-notebook, datasets

Jupyter notebooks in Russian. Introduction to Python, basic algorithms and data structures

Stars: ✭ 538 (+249.35%)

Mutual labels: russian, jupyter-notebook

Data Reading Blocks for Python

Stars: ✭ 82 (-46.75%)

Mutual labels: jupyter-notebook, datasets

Firstcoursenetworkscience

Tutorials, datasets, and other material associated with textbook "A First Course in Network Science" by Menczer, Fortunato & Davis

Stars: ✭ 111 (-27.92%)

Mutual labels: jupyter-notebook, datasets

Your First Kaggle Submission

How to perform an exploratory data analysis on the Kaggle Titanic dataset and make a submission to the leaderboard.

Stars: ✭ 155 (+0.65%)

Mutual labels: jupyter-notebook

Tencent social ads2017 mobile app pcvr

Tencent Social Ads 2017 contest rank 20

Stars: ✭ 155 (+0.65%)

Mutual labels: jupyter-notebook

Jupyter Vim Binding

Jupyter meets Vim. Vimmer will fall in love.

Stars: ✭ 1,965 (+1175.97%)

Mutual labels: jupyter-notebook

Programs for stock prediction and evaluation

Stars: ✭ 155 (+0.65%)

Mutual labels: jupyter-notebook

Stars: ✭ 154 (+0%)

Mutual labels: jupyter-notebook

Shifting More Attention to Video Salient Objection Detection, CVPR 2019 (Best paper finalist & Oral)

Stars: ✭ 155 (+0.65%)

Mutual labels: jupyter-notebook

Deep Q Learning

Stars: ✭ 155 (+0.65%)

Mutual labels: jupyter-notebook

Financial portfolio optimisation in python, including classical efficient frontier, Black-Litterman, Hierarchical Risk Parity

Stars: ✭ 2,502 (+1524.68%)

Mutual labels: jupyter-notebook

Deepreinforcementlearning

A replica of the AlphaZero methodology for deep reinforcement learning in Python

Stars: ✭ 1,898 (+1132.47%)

Mutual labels: jupyter-notebook

DataGene - Identify How Similar TS Datasets Are to One Another (by @firmai)

Stars: ✭ 156 (+1.3%)

Mutual labels: jupyter-notebook

Carnd Camera Calibration

Images and notebook for camera calibration

Stars: ✭ 155 (+0.65%)

Mutual labels: jupyter-notebook

View All Similar Projects ➔

Links to publicly available Russian corpora + code for loading and parsing. 20+ datasets, 350Gb+ of text.

Usage

For example lets use dump of lenta.ru by @yutkin. Manually download the archive (link in the Reference section):

wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz

Use corus to load the data:

>>> from corus import load_lenta

>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path)
>>> next(records)

LentaRecord(
    url='https://lenta.ru/news/2018/12/14/cancer/',
    title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака',
    text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...',
    topic='Россия',
    tags='Общество'
)

Iterate over texts:

>>> records = load_lenta(path)
>>> for record in records:
...     text = record.text
...     ...

For links to other datasets and their loaders see the Reference section.

Documentation

Materials are in Russian:

Install

corus supports Python 3.5+, PyPy 3.

$ pip install corus

Reference

Dataset	API `from corus import`	Tags	Texts	Uncompressed	Description
Lenta.ru
Lenta.ru v1.0	`load_lenta` `#`	`news`	739 351	1.66 Gb	`wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz`
Lenta.ru v1.1+	`load_lenta2` `#`	`news`	800 975	1.94 Gb	`wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2`
Lib.rus.ec	`load_librusec` `#`	`fiction`	301 871	144.92 Gb	Dump of lib.rus.ec prepared for RUSSE workshop `wget http://panchenko.me/data/russe/librusec_fb2.plain.gz`
Rossiya Segodnya	`load_ria_raw` `#` `load_ria` `#`	`news`	1 003 869	3.70 Gb	`wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz`
Mokoron Russian Twitter Corpus	`load_mokoron` `#`	`social` `sentiment`	17 633 417	1.86 Gb	Russian Twitter sentiment markup Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql
Wikipedia	`load_wiki` `#`		1 541 401	12.94 Gb	Russian Wiki dump `wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2`
GramEval2020	`load_gramru` `#`		162 372	30.04 Mb	`wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip` `unzip master.zip` `mv GramEval2020-master/dataTrain train` `mv GramEval2020-master/dataOpenTest dev` `rm -r master.zip GramEval2020-master` `wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu`
OpenCorpora	`load_corpora` `#`	`morph`	4 030	20.21 Mb	`wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip`
RusVectores SimLex-965	`load_simlex` `#`	`emb` `sim`			`wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv` `wget https://rusvectores.org/static/testsets/ru_simlex965.tsv`
Omnia Russica	`load_omnia` `#`	`morph` `web` `fiction`		489.62 Gb	Taiga + Wiki + Araneum. Read "Even larger Russian corpus" https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf Manually download http://bit.ly/2ZT4BY9
factRuEval-2016	`load_factru` `#`	`ner` `news`	254	969.27 Kb	Manual PER, LOC, ORG markup prepared for 2016 Dialog competition `wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip` `unzip master.zip` `rm master.zip`
Gareev	`load_gareev` `#`	`ner` `news`	97	455.02 Kb	Manual PER, ORG markup (no LOC) Email Rinat Gareev ([email protected]) ask for dataset `tar -xvf rus-ner-news-corpus.iob.tar.gz` `rm rus-ner-news-corpus.iob.tar.gz`
Collection5	`load_ne5` `#`	`ner` `news`	1 000	2.96 Mb	News articles with manual PER, LOC, ORG markup `wget http://www.labinform.ru/pub/named_entities/collection5.zip` `unzip collection5.zip` `rm collection5.zip`
WiNER	`load_wikiner` `#`	`ner`	203 287	36.15 Mb	Sentences from Wiki auto annotated with PER, LOC, ORG tags `wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2`
BSNLP-2019	`load_bsnlp` `#`	`ner`	464	1.16 Mb	Markup prepared for 2019 BSNLP Shared Task `wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zip` `wget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zip` `unzip TRAININGDATA_BSNLP_2019_shared_task.zip` `unzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bg` `rm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip`
Persons-1000	`load_persons` `#`	`ner` `news`	1 000	2.96 Mb	Same as Collection5, only PER markup + normalized names `wget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip`
The Russian Drug Reaction Corpus (RuDReC)	`load_rudrec` `#`	`ner`	4 809	1.73 Kb	RuDReC is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. Here you can download and work with the annotated part, to get the raw part (1.4M reviews) please refer to https://github.com/cimm-kzn/RuDReC. `wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json`
Taiga	Large collection of Russian texts from various sources: news sites, magazines, literacy, social networks `wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz` `tar -xzvf retagged_taiga.tar.gz`
Arzamas	`load_taiga_arzamas` `#`	`news`	311	4.50 Mb
Fontanka	`load_taiga_fontanka` `#`	`news`	342 683	786.23 Mb
Interfax	`load_taiga_interfax` `#`	`news`	46 429	77.55 Mb
KP	`load_taiga_kp` `#`	`news`	45 503	61.79 Mb
Lenta	`load_taiga_lenta` `#`	`news`	36 446	95.15 Mb
Taiga/N+1	`load_taiga_nplus1` `#`	`news`	7 696	24.96 Mb
Magazines	`load_taiga_magazines` `#`		39 890	2.19 Gb
Subtitles	`load_taiga_subtitles` `#`		19 011	909.08 Mb
Social	`load_taiga_social` `#`	`social`	1 876 442	648.18 Mb
Proza	`load_taiga_proza` `#`	`fiction`	1 732 434	38.25 Gb
Stihi	`load_taiga_stihi` `#`		9 157 686	12.80 Gb
Russian NLP Datasets	Several Russian news datasets from webhose.io, lenta.ru and other news sites.
News	`load_buriy_news` `#`	`news`	2 154 801	6.84 Gb	Dump of top 40 news + 20 fashion news sites. `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2` `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2` `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2`
Webhose	`load_buriy_webhose` `#`	`news`	285 965	859.32 Mb	Dump from webhose.io, 300 sources for one month. `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/webhose-2016.tar.bz2`
ODS #proj_news_viz	Several news sites scraped by members of #proj_news_viz ODS project.
Interfax	`load_ods_interfax` `#`	`news`	543 961	1.22 Gb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/interfax.csv.gz`
Gazeta	`load_ods_gazeta` `#`	`news`	865 847	1.63 Gb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/gazeta.csv.gz`
Izvestia	`load_ods_izvestia` `#`	`news`	86 601	307.19 Mb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/iz.csv.gz`
Meduza	`load_ods_meduza` `#`	`news`	71 806	270.11 Mb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/meduza.csv.gz`
RIA	`load_ods_ria` `#`	`news`	101 543	233.88 Mb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/ria.csv.gz`
Russia Today	`load_ods_rt` `#`	`news`	106 644	187.12 Mb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/rt.csv.gz`
TASS	`load_ods_tass` `#`	`news`	1 135 635	3.27 Gb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/tass-001.csv.gz`
Universal Dependencies
GSD	`load_ud_gsd` `#`	`morph` `syntax`	5 030	1.01 Mb	`wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu`
Taiga	`load_ud_taiga` `#`	`morph` `syntax`	3 264	353.80 Kb	`wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu`
PUD	`load_ud_pud` `#`	`morph` `syntax`	1 000	207.78 Kb	`wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu`
SynTagRus	`load_ud_syntag` `#`	`morph` `syntax`	61 889	11.33 Mb	`wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu`
morphoRuEval-2017
General Internet-Corpus	`load_morphoru_gicrya` `#`	`morph`	83 148	10.58 Mb	`wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zip` `unzip GIKRYA_texts_new.zip` `rm GIKRYA_texts_new.zip`
Russian National Corpus	`load_morphoru_rnc` `#`	`morph`	98 892	12.71 Mb	`wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rar` `unrar x RNC_texts.rar` `rm RNC_texts.rar`
OpenCorpora	`load_morphoru_corpora` `#`	`morph`	38 510	4.80 Mb	`wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rar` `unrar x OpenCorpora_Texts.rar` `rm OpenCorpora_Texts.rar`
RUSSE Russian Semantic Relatedness
HJ: Human Judgements of Word Pairs	`load_russe_hj` `#`	`emb` `sim`			`wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv`
RT: Synonyms and Hypernyms from the Thesaurus RuThes	`load_russe_rt` `#`	`emb` `sim`			`wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv`
AE: Cognitive Associations from the Sociation.org Experiment	`load_russe_ae` `#`	`emb` `sim`			`wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csv` `wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csv` `wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv`
Toloka Datasets
Lexical Relations from the Wisdom of the Crowd (LRWC)	`load_toloka_lrwc` `#`	`emb` `sim`			`wget https://tlk.s3.yandex.net/dataset/LRWC.zip` `unzip LRWC.zip` `rm LRWC.zip`
The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT)	`load_ruadrect` `#`	`social`	9 515	2.09 Mb	This corpus was developed for the Social Media Mining for Health Applications (#SMM4H) Shared Task 2020 `wget https://github.com/cimm-kzn/RuDReC/raw/master/data/RuADReCT.zip` `unzip RuADReCT.zip` `rm RuADReCT.zip`

Support

Chat — https://telegram.me/natural_language_processing
Issues — https://github.com/natasha/corus/issues
Commercial support — https://lab.alexkuk.ru

Development

Tests:

make test

Add new source:

Implement corus/sources/<source>.py
Add import into corus/sources/__init__.py
Add meta into corus/source/meta.py
Add example into docs.ipynb (check meta table is correct)
Run tests (readme is updated)

Package:

make version
git push
git push --tags

make clean wheel upload

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 154

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (58) 🔗