Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → chakki-works → Coarij

chakki-works / Coarij

Licence: mit

Corpus of Annual Reports in Japan

Programming Languages

139335 projects - #7 most used programming language

Labels

natural-language-processing dataset finance corpus

Projects that are alternatives of or similar to Coarij

A dataset of millions of news articles scraped from a curated list of data sources.

Stars: ✭ 255 (+363.64%)

Mutual labels: dataset, corpus, natural-language-processing

Awesome Hungarian Nlp

A curated list of NLP resources for Hungarian

Stars: ✭ 121 (+120%)

Mutual labels: dataset, corpus, natural-language-processing

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Stars: ✭ 108 (+96.36%)

Mutual labels: dataset, corpus, natural-language-processing

Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text

Stars: ✭ 139 (+152.73%)

Mutual labels: dataset, corpus, natural-language-processing

Insuranceqa Corpus Zh

🚁 保险行业语料库，聊天机器人

Stars: ✭ 821 (+1392.73%)

Mutual labels: dataset, corpus, natural-language-processing

Nlp bahasa resources

A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia

Stars: ✭ 158 (+187.27%)

Mutual labels: dataset, corpus, natural-language-processing

A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.

Stars: ✭ 283 (+414.55%)

Mutual labels: dataset, natural-language-processing

A collection of datasets that pair questions with SQL queries.

Stars: ✭ 287 (+421.82%)

Mutual labels: dataset, natural-language-processing

Weixin public corpus

微信公众号语料库

Stars: ✭ 465 (+745.45%)

Mutual labels: corpus, natural-language-processing

Hate Speech And Offensive Language

Repository for the paper "Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017

Stars: ✭ 543 (+887.27%)

Mutual labels: dataset, natural-language-processing

Chinese Names Corpus

中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。

Stars: ✭ 3,053 (+5450.91%)

Mutual labels: dataset, corpus

Open source annotation tool for machine learning practitioners.

Stars: ✭ 5,600 (+10081.82%)

Mutual labels: dataset, natural-language-processing

An R package for the Quantitative Analysis of Textual Data

Stars: ✭ 647 (+1076.36%)

Mutual labels: corpus, natural-language-processing

Medical-Names-Corpus

医疗语料库。医疗机构名语料库。药品本位码。

Stars: ✭ 26 (-52.73%)

Mutual labels: corpus, dataset

Awesome Persian Nlp Ir

Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources

Stars: ✭ 460 (+736.36%)

Mutual labels: corpus, natural-language-processing

Species-Names-Corpus

物种名称语料库。植物名,动物名。

Stars: ✭ 23 (-58.18%)

Mutual labels: corpus, dataset

Cluepretrainedmodels

高质量中文预训练模型集合：最先进大模型、最快小模型、相似度专门模型

Stars: ✭ 493 (+796.36%)

Mutual labels: dataset, corpus

Company Names Corpus

公司名语料库。机构名语料库。公司简称,缩写,品牌词,企业名。可用于中文分词、机构名实体识别。

Stars: ✭ 868 (+1478.18%)

Mutual labels: dataset, corpus

Typing Assistant

Typing Assistant provides the ability to autocomplete words and suggests predictions for the next word. This makes typing faster, more intelligent and reduces effort.

Stars: ✭ 32 (-41.82%)

Mutual labels: corpus, natural-language-processing

Code for the collection and analysis of the MTNT dataset

Stars: ✭ 48 (-12.73%)

Mutual labels: dataset, natural-language-processing

View All Similar Projects ➔

CoARiJ: Corpus of Annual Reports in Japan

We organized Japanese financial reports to encourage applying NLP techniques to financial analytics.

Dataset

The corpora are separated to each financial years.

master version.

fiscal_year	Raw file version (F)	Text extracted version (E)
2014	.zip (9.3GB)	.zip (269.9MB)
2015	.zip (9.8GB)	.zip (291.1MB)
2016	.zip (10.2GB)	.zip (334.7MB)
2017	.zip (9.1GB)	.zip (309.4MB)
2018	.zip (10.5GB)	.zip (260.9MB)

financial data is from 決算短信情報.
- We use non-cosolidated data if it exist.
stock data is from 月間相場表（内国株式）.
- close is fiscal period end and open is 1 year before of it.

Past release

v1.0

Statistics

fiscal_year	number_of_reports	has_csr_reports	has_financial_data	has_stock_data
2014	3,724	92	3,583	3,595
2015	3,870	96	3,725	3,751
2016	4,066	97	3,924	3,941
2017	3,578	89	3,441	3,472
2018	3,513	70	2,893	3,413

File structure

Raw file version (`--kind F`)

The structure of dataset is following.

chakki_esg_financial_{year}.zip
└──{year}
     ├── documents.csv
     └── docs/

docs includes XBRL and PDF file.

XBRL file of annual reports (files are retrieved from EDINET).
PDF file of CSR reports (additional content).

documents.csv has metadata like following. Please refer the detail at Wiki.

edinet_code: E0000X
filer_name: XXX株式会社
fiscal_year: 201X
fiscal_period: FY
doc_path: docs/S000000X.xbrl
csr_path: docs/E0000X_201X_JP_36.pdf

Text extracted version (`--kind E`)

Text extracted version includes txt files that match each part of an annual report.
The extracted parts are defined at xbrr.

chakki_esg_financial_{year}_extracted.zip
└──{year}
     ├── documents.csv
     └── docs/

Tool

You can download dataset by command line tool.

pip install coarij

Please refer the usage by -- (using fire).

coarij --

Example command.

# Download raw file version dataset of 2014.
coarij download --kind F --year 2014

# Extract business.overview_of_result part of TIS.Inc (sec code=3626).
coarij extract business.overview_of_result --sec_code 3626

# Tokenize text by Janome (`janome` or `sudachi` is supported).
pip install janome
coarij tokenize --tokenizer janome

# Show tokenized result (words are separated by \t).
head -n 5 data/processed/2014/docs/S100552V_business_overview_of_result_tokenized.txt
1       【      業績    等      の      概要    】
(       1       )               業績
当      連結    会計    年度    における        我が国  経済    は      、     消費    税率    引上げ  に      伴う    駆け込み        需要    の      反動   や      海外    景気    動向    に対する        先行き  懸念    等      から   弱い    動き    も      見      られ    まし    た      が      、      企業   収益    の      改善    等      により  全体  ...

If you want to download latest dataset, please specify --version master when download the data.

About the parsable part, please refer the xbrr.

You can use Ledger to select your necessary file from overall CoARiJ dataset.

from coarij.storage import Storage


storage = Storage("your/data/directory")
ledger = storage.get_ledger()
collected = ledger.collect(edinet_code="E00021")

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 55

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (4) 🔗