All Projects → IlyaGusev → PoetryCorpus

IlyaGusev / PoetryCorpus

Licence: Apache-2.0 license
Поэтический корпус русского языка

Programming Languages

python
139335 projects - #7 most used programming language
HTML
75241 projects
javascript
184084 projects - #8 most used programming language
CSS
56736 projects
shell
77523 projects

Projects that are alternatives of or similar to PoetryCorpus

OpenConvert
Text conversion tool (from e.g. Word, HTML, txt) to corpus formats TEI or FoLiA)
Stars: ✭ 20 (-50%)
Mutual labels:  corpus
BSD
The Business Scene Dialogue corpus
Stars: ✭ 51 (+27.5%)
Mutual labels:  corpus
TV4Dialog
No description or website provided.
Stars: ✭ 33 (-17.5%)
Mutual labels:  corpus
nytwit
New York Times Word Innovation Types dataset
Stars: ✭ 21 (-47.5%)
Mutual labels:  corpus
textbox
Text collections made available by the CLiGS group.
Stars: ✭ 19 (-52.5%)
Mutual labels:  corpus
thaigov-corpus
โครงการเก็บรวบรวมข่าวสารจากเว็บไซต์รัฐบาลไทย
Stars: ✭ 19 (-52.5%)
Mutual labels:  corpus
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Stars: ✭ 711 (+1677.5%)
Mutual labels:  corpus
CBLUE
中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Stars: ✭ 379 (+847.5%)
Mutual labels:  corpus
When-in-Rome
A meta-corpus of functional harmonic analysis.
Stars: ✭ 35 (-12.5%)
Mutual labels:  corpus
LanguageCodes
We present a list of languages with their codes, families, regions and etc. We also present a list of multi-lingual corpora (with urls).
Stars: ✭ 70 (+75%)
Mutual labels:  corpus
gum
Repository for the Georgetown University Multilayer Corpus (GUM)
Stars: ✭ 71 (+77.5%)
Mutual labels:  corpus
malay-dataset
Text corpus for Bahasa Malaysia, https://malaya.readthedocs.io/en/latest/Dataset.html
Stars: ✭ 189 (+372.5%)
Mutual labels:  corpus
kanji-frequency
Kanji usage frequency data collected from various sources
Stars: ✭ 92 (+130%)
Mutual labels:  corpus
ocr2text
Convert a PDF via OCR to a TXT file in UTF-8 encoding
Stars: ✭ 90 (+125%)
Mutual labels:  corpus
jrte-corpus
Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
Stars: ✭ 66 (+65%)
Mutual labels:  corpus
opensource-voice-tools
A repo listing known open source voice tools, ordered by where they sit in the voice stack
Stars: ✭ 21 (-47.5%)
Mutual labels:  corpus
mev-corpus
MEV Data Corpus
Stars: ✭ 77 (+92.5%)
Mutual labels:  corpus
pdf-corpus
Python script to quickly create hand-crafted PDF files
Stars: ✭ 17 (-57.5%)
Mutual labels:  corpus
egret-wenda-corpus
A Public Corpus for Machine Learning
Stars: ✭ 41 (+2.5%)
Mutual labels:  corpus
text-classification-cn
中文文本分类实践,基于搜狗新闻语料库,采用传统机器学习方法以及预训练模型等方法
Stars: ✭ 81 (+102.5%)
Mutual labels:  corpus

Поэтический корпус русского языка

Build Status Code Climate

Пакет для анализа и синтеза стихов: https://github.com/IlyaGusev/rupo

Статистика корпуса текстов с метаинформацией

  • Символов: 13208090
  • Слов: 2186827
  • Стихотворений: 16694
  • Стихотворений, протегированных темами: 3904
  • Авторов: 195

Установка зависимостей вручную

sudo apt-get install build-essential libssl-dev libffi-dev python-dev libxslt1-dev libxslt1.1 libxml2-dev libxml2 libssl-dev
sudo pip3 install -r requirements.txt

Препроцессинг

# "Пауки", собирающие тексты с сайтов
scrapy runspider poetry/apps/corpus/spiders/klassika.py -o datasets/web/klassika.xml
scrapy runspider poetry/apps/corpus/spiders/strofa.py -o datasets/web/strofa.xml
scrapy runspider poetry/apps/corpus/spiders/themes.py -o datasets/web/themes.xml
scrapy runspider poetry/apps/corpus/spiders/rupoem.py -o datasets/web/rupoem.xml
# Скрипт объединения и дедупликации текстов, генерация xml и json версий корпуса текстов
python3 poetry/apps/corpus/scripts/unite.py

or

# Получить готовую версию корпуса
git lfs pull

Для инициализации базы данных с разметкой по слогам и ударениям

sh reset_db.sh

Запуск через Docker Compose

# Установка Docker и docker-compose
curl -sSL https://get.docker.com/ | sh
curl -L "https://github.com/docker/compose/releases/download/1.10.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
# Запуск
docker-compose up

Литература

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].