Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

indobenchmark / Indonlu

Licence: mit

The first-ever vast natural language processing benchmark for Indonesian Language. We provide multiple downstream tasks, pre-trained IndoBERT models, and a starter code! (AACL-IJCNLP 2020)

Programming Languages

python

139335 projects - #7 most used programming language

Labels

nlp benchmark datasets nlu

Projects that are alternatives of or similar to Indonlu

Atis dataset

The ATIS (Airline Travel Information System) Dataset

Stars: ✭ 81 (-59.09%)

Mutual labels: datasets, nlu

DiscEval

Discourse Based Evaluation of Language Understanding

Stars: ✭ 18 (-90.91%)

Mutual labels: benchmark, datasets

scRNAseq cell cluster labeling

Scripts to run and benchmark scRNA-seq cell cluster labeling methods

Stars: ✭ 41 (-79.29%)

Mutual labels: benchmark, datasets

Clue

中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard

Stars: ✭ 2,425 (+1124.75%)

Mutual labels: nlu, benchmark

Sembert

Semantics-aware BERT for Language Understanding (AAAI 2020)

Stars: ✭ 194 (-2.02%)

Mutual labels: nlu

Gotemplatebenchmark

comparing the performance of different template engines

Stars: ✭ 180 (-9.09%)

Mutual labels: benchmark

Jmh Visualizer

Visually explore your JMH Benchmarks

Stars: ✭ 180 (-9.09%)

Mutual labels: benchmark

Tsung

Tsung is a high-performance benchmark framework for various protocols including HTTP, XMPP, LDAP, etc.

Stars: ✭ 2,185 (+1003.54%)

Mutual labels: benchmark

Filecoin

CoinSummer实验室Filecoin资源分享。

Stars: ✭ 191 (-3.54%)

Mutual labels: benchmark

Nlp datasets

My NLP datasets for Russian language

Stars: ✭ 198 (+0%)

Mutual labels: datasets

Datasaurus

R Package 📦 Containing the Datasaurus Dozen datasets 📊

Stars: ✭ 193 (-2.53%)

Mutual labels: datasets

Jax Rs Performance Comparison

⚡️ Performance Comparison of Jax-RS implementations and embedded containers

Stars: ✭ 181 (-8.59%)

Mutual labels: benchmark

Awesome Pretrained Chinese Nlp Models

Awesome Pretrained Chinese NLP Models，高质量中文预训练模型集合

Stars: ✭ 195 (-1.52%)

Mutual labels: nlu

Sangrenel

Apache Kafka load testing "...basically a cloth bag filled with small jagged pieces of scrap iron"

Stars: ✭ 180 (-9.09%)

Mutual labels: benchmark

Awesome Json Datasets

A curated list of awesome JSON datasets that don't require authentication.

Stars: ✭ 2,421 (+1122.73%)

Mutual labels: datasets

Train Ticket

Train Ticket - A Benchmark Microservice System

Stars: ✭ 180 (-9.09%)

Mutual labels: benchmark

Blue Team

Blue Team Scripts

Stars: ✭ 190 (-4.04%)

Mutual labels: benchmark

Openvqa

A lightweight, scalable, and general framework for visual question answering research

Stars: ✭ 198 (+0%)

Mutual labels: benchmark

Tracerbench

Automated Chrome tracing for benchmarking.

Stars: ✭ 189 (-4.55%)

Mutual labels: benchmark

Ann Benchmarks

Benchmarks of approximate nearest neighbor libraries in Python

Stars: ✭ 2,658 (+1242.42%)

Mutual labels: benchmark

View All Similar Projects ➔

IndoNLU

Baca README ini dalam Bahasa Indonesia.

IndoNLU is a collection of Natural Language Understanding (NLU) resources for Bahasa Indonesia with 12 downstream tasks. We provide the code to reproduce the results and large pre-trained models (IndoBERT and IndoBERT-lite) trained with around 4 billion word corpus (Indo4B), more than 20 GB of text data. This project was initially started by a joint collaboration between universities and industry, such as Institut Teknologi Bandung, Universitas Multimedia Nusantara, The Hong Kong University of Science and Technology, Universitas Indonesia, Gojek, and Prosa.AI.

Research Paper

IndoNLU has been accepted by AACL-IJCNLP 2020 and you can find the details in our paper https://www.aclweb.org/anthology/2020.aacl-main.85.pdf. If you are using any component on IndoNLU including Indo4B, FastText-Indo4B, or IndoBERT in your work, please cite the following paper:

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}

How to contribute to IndoNLU?

Be sure to check the contributing guidelines and contact the maintainers or open an issue to collect feedbacks before starting your PR.

12 Downstream Tasks

You can check [Link]
We provide train, valid, and test sets. The labels of the test set are masked (no true labels) in order to preserve the integrity of the evaluation. Please submit your predictions to the submission portal at CodaLab

Examples

A guide to load IndoBERT model and finetune the model on Sequence Classification and Sequence Tagging task.
You can check link

Submission Format

Please kindly check the link. For each task, there is different format. Every submission file always start with the index column (the id of the test sample following the order of the masked test set).

For the submission, first you need to rename your prediction into pred.txt, then zip the file. After that, you need to allow the system to compute the results. You can easily check the progress in your results tab.

Indo4B Dataset

We provide the access to our large pretraining dataset. In this version, we exclude all Twitter tweets due to restrictions of the Twitter Developer Policy and Agreement.

Indo4B Dataset (23 GB uncompressed, 5.6 GB compressed) [Link]

IndoBERT and IndoBERT-lite Models

We provide 4 IndoBERT and 4 IndoBERT-lite Pretrained Language Model [Link]

IndoBERT-base
- Phase 1 [Link]
- Phase 2 [Link]
IndoBERT-large
- Phase 1 [Link]
- Phase 2 [Link]
IndoBERT-lite-base
- Phase 1 [Link]
- Phase 2 [Link]
IndoBERT-lite-large
- Phase 1 [Link]
- Phase 2 [Link]

FastText (Indo4B)

We provide the full uncased FastText model file (11.9 GB) and the corresponding Vector file (3.9 GB)

FastText model (11.9 GB) [Link]
Vector file (3.9 GB) [Link]

We provide smaller FastText models with smaller vocabulary for each of the 12 downstream tasks

FastText-Indo4B [Link]
FastText-CC-ID [Link]

Leaderboard

Community Portal and Public Leaderboard [Link]
Submission Portal https://competitions.codalab.org/competitions/26537

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 198

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (3) 🔗