All Projects → chakki-works → Coarij

chakki-works / Coarij

Licence: mit
Corpus of Annual Reports in Japan

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Coarij

Fakenewscorpus
A dataset of millions of news articles scraped from a curated list of data sources.
Stars: ✭ 255 (+363.64%)
Mutual labels:  dataset, corpus, natural-language-processing
Awesome Hungarian Nlp
A curated list of NLP resources for Hungarian
Stars: ✭ 121 (+120%)
Mutual labels:  dataset, corpus, natural-language-processing
Ua Gec
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Stars: ✭ 108 (+96.36%)
Mutual labels:  dataset, corpus, natural-language-processing
Prosody
Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Stars: ✭ 139 (+152.73%)
Mutual labels:  dataset, corpus, natural-language-processing
Insuranceqa Corpus Zh
🚁 保险行业语料库,聊天机器人
Stars: ✭ 821 (+1392.73%)
Mutual labels:  dataset, corpus, natural-language-processing
Nlp bahasa resources
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Stars: ✭ 158 (+187.27%)
Mutual labels:  dataset, corpus, natural-language-processing
Oie Resources
A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.
Stars: ✭ 283 (+414.55%)
Mutual labels:  dataset, natural-language-processing
Text2sql Data
A collection of datasets that pair questions with SQL queries.
Stars: ✭ 287 (+421.82%)
Mutual labels:  dataset, natural-language-processing
Weixin public corpus
微信公众号语料库
Stars: ✭ 465 (+745.45%)
Mutual labels:  corpus, natural-language-processing
Hate Speech And Offensive Language
Repository for the paper "Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017
Stars: ✭ 543 (+887.27%)
Mutual labels:  dataset, natural-language-processing
Chinese Names Corpus
中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。
Stars: ✭ 3,053 (+5450.91%)
Mutual labels:  dataset, corpus
Doccano
Open source annotation tool for machine learning practitioners.
Stars: ✭ 5,600 (+10081.82%)
Mutual labels:  dataset, natural-language-processing
Quanteda
An R package for the Quantitative Analysis of Textual Data
Stars: ✭ 647 (+1076.36%)
Mutual labels:  corpus, natural-language-processing
Medical-Names-Corpus
医疗语料库。医疗机构名语料库。药品本位码。
Stars: ✭ 26 (-52.73%)
Mutual labels:  corpus, dataset
Awesome Persian Nlp Ir
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Stars: ✭ 460 (+736.36%)
Mutual labels:  corpus, natural-language-processing
Species-Names-Corpus
物种名称语料库。植物名,动物名。
Stars: ✭ 23 (-58.18%)
Mutual labels:  corpus, dataset
Cluepretrainedmodels
高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型
Stars: ✭ 493 (+796.36%)
Mutual labels:  dataset, corpus
Company Names Corpus
公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。
Stars: ✭ 868 (+1478.18%)
Mutual labels:  dataset, corpus
Typing Assistant
Typing Assistant provides the ability to autocomplete words and suggests predictions for the next word. This makes typing faster, more intelligent and reduces effort.
Stars: ✭ 32 (-41.82%)
Mutual labels:  corpus, natural-language-processing
Mtnt
Code for the collection and analysis of the MTNT dataset
Stars: ✭ 48 (-12.73%)
Mutual labels:  dataset, natural-language-processing

CoARiJ: Corpus of Annual Reports in Japan

PyPI version Build Status codecov

We organized Japanese financial reports to encourage applying NLP techniques to financial analytics.

Dataset

The corpora are separated to each financial years.

master version.

fiscal_year Raw file version (F) Text extracted version (E)
2014 .zip (9.3GB) .zip (269.9MB)
2015 .zip (9.8GB) .zip (291.1MB)
2016 .zip (10.2GB) .zip (334.7MB)
2017 .zip (9.1GB) .zip (309.4MB)
2018 .zip (10.5GB) .zip (260.9MB)

Past release

Statistics

fiscal_year number_of_reports has_csr_reports has_financial_data has_stock_data
2014 3,724 92 3,583 3,595
2015 3,870 96 3,725 3,751
2016 4,066 97 3,924 3,941
2017 3,578 89 3,441 3,472
2018 3,513 70 2,893 3,413

File structure

Raw file version (--kind F)

The structure of dataset is following.

chakki_esg_financial_{year}.zip
└──{year}
     ├── documents.csv
     └── docs/

docs includes XBRL and PDF file.

  • XBRL file of annual reports (files are retrieved from EDINET).
  • PDF file of CSR reports (additional content).

documents.csv has metadata like following. Please refer the detail at Wiki.

  • edinet_code: E0000X
  • filer_name: XXX株式会社
  • fiscal_year: 201X
  • fiscal_period: FY
  • doc_path: docs/S000000X.xbrl
  • csr_path: docs/E0000X_201X_JP_36.pdf

Text extracted version (--kind E)

Text extracted version includes txt files that match each part of an annual report.
The extracted parts are defined at xbrr.

chakki_esg_financial_{year}_extracted.zip
└──{year}
     ├── documents.csv
     └── docs/

Tool

You can download dataset by command line tool.

pip install coarij

Please refer the usage by -- (using fire).

coarij --

Example command.

# Download raw file version dataset of 2014.
coarij download --kind F --year 2014

# Extract business.overview_of_result part of TIS.Inc (sec code=3626).
coarij extract business.overview_of_result --sec_code 3626

# Tokenize text by Janome (`janome` or `sudachi` is supported).
pip install janome
coarij tokenize --tokenizer janome

# Show tokenized result (words are separated by \t).
head -n 5 data/processed/2014/docs/S100552V_business_overview_of_result_tokenized.txt
1       【      業績    等      の      概要    】
(       1       )               業績
当      連結    会計    年度    における        我が国  経済    は      、     消費    税率    引上げ  に      伴う    駆け込み        需要    の      反動   や      海外    景気    動向    に対する        先行き  懸念    等      から   弱い    動き    も      見      られ    まし    た      が      、      企業   収益    の      改善    等      により  全体  ...

If you want to download latest dataset, please specify --version master when download the data.

  • About the parsable part, please refer the xbrr.

You can use Ledger to select your necessary file from overall CoARiJ dataset.

from coarij.storage import Storage


storage = Storage("your/data/directory")
ledger = storage.get_ledger()
collected = ledger.collect(edinet_code="E00021")
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].