All Projects → indobenchmark → Indonlu

indobenchmark / Indonlu

Licence: mit
The first-ever vast natural language processing benchmark for Indonesian Language. We provide multiple downstream tasks, pre-trained IndoBERT models, and a starter code! (AACL-IJCNLP 2020)

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Indonlu

Atis dataset
The ATIS (Airline Travel Information System) Dataset
Stars: ✭ 81 (-59.09%)
Mutual labels:  datasets, nlu
DiscEval
Discourse Based Evaluation of Language Understanding
Stars: ✭ 18 (-90.91%)
Mutual labels:  benchmark, datasets
scRNAseq cell cluster labeling
Scripts to run and benchmark scRNA-seq cell cluster labeling methods
Stars: ✭ 41 (-79.29%)
Mutual labels:  benchmark, datasets
Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+1124.75%)
Mutual labels:  nlu, benchmark
Sembert
Semantics-aware BERT for Language Understanding (AAAI 2020)
Stars: ✭ 194 (-2.02%)
Mutual labels:  nlu
Gotemplatebenchmark
comparing the performance of different template engines
Stars: ✭ 180 (-9.09%)
Mutual labels:  benchmark
Jmh Visualizer
Visually explore your JMH Benchmarks
Stars: ✭ 180 (-9.09%)
Mutual labels:  benchmark
Tsung
Tsung is a high-performance benchmark framework for various protocols including HTTP, XMPP, LDAP, etc.
Stars: ✭ 2,185 (+1003.54%)
Mutual labels:  benchmark
Filecoin
CoinSummer实验室Filecoin资源分享。
Stars: ✭ 191 (-3.54%)
Mutual labels:  benchmark
Nlp datasets
My NLP datasets for Russian language
Stars: ✭ 198 (+0%)
Mutual labels:  datasets
Datasaurus
R Package 📦 Containing the Datasaurus Dozen datasets 📊
Stars: ✭ 193 (-2.53%)
Mutual labels:  datasets
Jax Rs Performance Comparison
⚡️ Performance Comparison of Jax-RS implementations and embedded containers
Stars: ✭ 181 (-8.59%)
Mutual labels:  benchmark
Awesome Pretrained Chinese Nlp Models
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型集合
Stars: ✭ 195 (-1.52%)
Mutual labels:  nlu
Sangrenel
Apache Kafka load testing "...basically a cloth bag filled with small jagged pieces of scrap iron"
Stars: ✭ 180 (-9.09%)
Mutual labels:  benchmark
Awesome Json Datasets
A curated list of awesome JSON datasets that don't require authentication.
Stars: ✭ 2,421 (+1122.73%)
Mutual labels:  datasets
Train Ticket
Train Ticket - A Benchmark Microservice System
Stars: ✭ 180 (-9.09%)
Mutual labels:  benchmark
Blue Team
Blue Team Scripts
Stars: ✭ 190 (-4.04%)
Mutual labels:  benchmark
Openvqa
A lightweight, scalable, and general framework for visual question answering research
Stars: ✭ 198 (+0%)
Mutual labels:  benchmark
Tracerbench
Automated Chrome tracing for benchmarking.
Stars: ✭ 189 (-4.55%)
Mutual labels:  benchmark
Ann Benchmarks
Benchmarks of approximate nearest neighbor libraries in Python
Stars: ✭ 2,658 (+1242.42%)
Mutual labels:  benchmark

IndoNLU

Pull Requests Welcome GitHub license Contributor Covenant

Baca README ini dalam Bahasa Indonesia.

IndoNLU is a collection of Natural Language Understanding (NLU) resources for Bahasa Indonesia with 12 downstream tasks. We provide the code to reproduce the results and large pre-trained models (IndoBERT and IndoBERT-lite) trained with around 4 billion word corpus (Indo4B), more than 20 GB of text data. This project was initially started by a joint collaboration between universities and industry, such as Institut Teknologi Bandung, Universitas Multimedia Nusantara, The Hong Kong University of Science and Technology, Universitas Indonesia, Gojek, and Prosa.AI.

Research Paper

IndoNLU has been accepted by AACL-IJCNLP 2020 and you can find the details in our paper https://www.aclweb.org/anthology/2020.aacl-main.85.pdf. If you are using any component on IndoNLU including Indo4B, FastText-Indo4B, or IndoBERT in your work, please cite the following paper:

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}

How to contribute to IndoNLU?

Be sure to check the contributing guidelines and contact the maintainers or open an issue to collect feedbacks before starting your PR.

12 Downstream Tasks

  • You can check [Link]
  • We provide train, valid, and test sets. The labels of the test set are masked (no true labels) in order to preserve the integrity of the evaluation. Please submit your predictions to the submission portal at CodaLab

Examples

  • A guide to load IndoBERT model and finetune the model on Sequence Classification and Sequence Tagging task.
  • You can check link

Submission Format

Please kindly check the link. For each task, there is different format. Every submission file always start with the index column (the id of the test sample following the order of the masked test set).

For the submission, first you need to rename your prediction into pred.txt, then zip the file. After that, you need to allow the system to compute the results. You can easily check the progress in your results tab.

Indo4B Dataset

We provide the access to our large pretraining dataset. In this version, we exclude all Twitter tweets due to restrictions of the Twitter Developer Policy and Agreement.

  • Indo4B Dataset (23 GB uncompressed, 5.6 GB compressed) [Link]

IndoBERT and IndoBERT-lite Models

We provide 4 IndoBERT and 4 IndoBERT-lite Pretrained Language Model [Link]

FastText (Indo4B)

We provide the full uncased FastText model file (11.9 GB) and the corresponding Vector file (3.9 GB)

  • FastText model (11.9 GB) [Link]
  • Vector file (3.9 GB) [Link]

We provide smaller FastText models with smaller vocabulary for each of the 12 downstream tasks

Leaderboard

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].