All Projects → kmkurn → Id Nlp Resource

kmkurn / Id Nlp Resource

A list of Indonesian NLP resources.

Projects that are alternatives of or similar to Id Nlp Resource

Deep Math Machine Learning.ai
A blog which talks about machine learning, deep learning algorithms and the Math. and Machine learning algorithms written from scratch.
Stars: ✭ 173 (-6.49%)
Mutual labels:  natural-language-processing
Stopwords
Default English stopword lists from many different sources
Stars: ✭ 179 (-3.24%)
Mutual labels:  natural-language-processing
Recurrent Convolutional Neural Network Text Classifier
My (slightly modified) Keras implementation of the Recurrent Convolutional Neural Network (RCNN) described here: http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9745.
Stars: ✭ 182 (-1.62%)
Mutual labels:  natural-language-processing
Web Database Analytics
Web scrapping and related analytics using Python tools
Stars: ✭ 175 (-5.41%)
Mutual labels:  natural-language-processing
Cs224n 2019
My completed implementation solutions for CS224N 2019
Stars: ✭ 178 (-3.78%)
Mutual labels:  natural-language-processing
Deeptoxic
top 1% solution to toxic comment classification challenge on Kaggle.
Stars: ✭ 180 (-2.7%)
Mutual labels:  natural-language-processing
Knockknock
🚪✊Knock Knock: Get notified when your training ends with only two additional lines of code
Stars: ✭ 2,304 (+1145.41%)
Mutual labels:  natural-language-processing
Dkpro Core
Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
Stars: ✭ 184 (-0.54%)
Mutual labels:  natural-language-processing
Cookiecutter Spacy Fastapi
Cookiecutter API for creating Custom Skills for Azure Search using Python and Docker
Stars: ✭ 179 (-3.24%)
Mutual labels:  natural-language-processing
Bert Sklearn
a sklearn wrapper for Google's BERT model
Stars: ✭ 182 (-1.62%)
Mutual labels:  natural-language-processing
Cleannlp
R package providing annotators and a normalized data model for natural language processing
Stars: ✭ 174 (-5.95%)
Mutual labels:  natural-language-processing
Nel
Entity linking framework
Stars: ✭ 176 (-4.86%)
Mutual labels:  natural-language-processing
Kb Infobot
A dialogue bot for information access
Stars: ✭ 181 (-2.16%)
Mutual labels:  natural-language-processing
Transformers.jl
Julia Implementation of Transformer models
Stars: ✭ 173 (-6.49%)
Mutual labels:  natural-language-processing
Texar
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
Stars: ✭ 2,236 (+1108.65%)
Mutual labels:  natural-language-processing
Multimodal Sentiment Analysis
Attention-based multimodal fusion for sentiment analysis
Stars: ✭ 172 (-7.03%)
Mutual labels:  natural-language-processing
Nlp profiler
A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.
Stars: ✭ 181 (-2.16%)
Mutual labels:  natural-language-processing
Glad
Global-Locally Self-Attentive Dialogue State Tracker
Stars: ✭ 185 (+0%)
Mutual labels:  natural-language-processing
Hntitlenator
Test your HN title against a neural network
Stars: ✭ 184 (-0.54%)
Mutual labels:  natural-language-processing
Sentence Similarity
This repository contains various ways to calculate sentence vector similarity using NLP models
Stars: ✭ 182 (-1.62%)
Mutual labels:  natural-language-processing

Indonesian NLP resources

Language modeling

  1. Kompas online collection. This corpus contains Kompas online news articles from 2001-2002. See here for more info and citations.
  2. Tempo online collection. This corpus contains Tempo online news articles from 2000-2002. See here for more info and citations.
  3. OSCAR. This large corpus contains articles from many sources crawled by CommonCrawl and extracted by ALMAnaCH. In total there are 4B words tokens and 2B word types. (NOTE: Contains strong language, mostly coming from gambling sites.)
  4. Leipzig corpora collection. Indonesian mixed corpus based on material from 2013. Sentences: 74,329,815 - Types: 7,964,109 - Tokens: 1,206,281,985. From news materials, randomly chosen websites, and Wikipedia dumps.
  5. CC-100. This large corpus contains articles from many sources crawled by CommonCrawl and extracted by FAIR. For Bahasa Indonesia, in total there are around 4.8B sentences and 6B sentence piece tokens. See here for more info and citations.
  6. IndoNLU Benchmark A collective effort made by researchers and practitioners from Gojek, Institut Teknologi Bandung, HKUST, Universitas Multimedia Nusantara, Prosa.ai, and Universitas Indonesia. They provide pre-trained BERT/ALBERT language models that were trained on a large corpus of 4B words (250M sentences). They also create single-sentence and sentence-pair datasets for evaluating classification and sequence-tagging tasks.

POS tagging

  1. PANL10N POS tagging. This corpus has 39K sentences and 900K word tokens.
  2. IDN tagged corpus. This corpus contains 10K sentences and 250K word tokens. The POS tags are annotated manually.

Sentiment analysis

  1. Aspect and Opinion Terms Extraction for Hotel Reviews. The corpus consists of 5000 hotel reviews from Airy (78K tokens) with 5 labels. The paper is available on arXiv.
  2. Aspect-Based Sentiment Analysis. A text classification resource for multi-label aspect categorization.

Syntactic parsing

  1. Indonesian Treebank. This corpus contains 1K parsed sentences. (constituency parsing)
  2. UD Indonesian. This corpus is provided by Universal Dependencies. Training, development, and testing split are already provided. (dependency parsing)

Machine translation

  1. PANL10N EN-ID news parallel corpus. This corpus has sentences from news articles from several categories: economy (6K sentences), international (6K sentences), science (6K sentences), and sport (4K sentences).
  2. PANL10N Indonesian translation of Penn treebank. This corpus contains Indonesian translation of the Penn treebank. In total there are 24K sentences.
  3. OPUS (Open Parallel Corpus). This site contains parallel corpora of Indonesian and other languages based on openly available resources (e.g., OpenSubtitles).
  4. IDENTICv1.0 [paper]. Indonesian (ID)-English (EN). 45k sentences/~1M tokens (ID). Domain: science, sport, international, economy, news article, movie subtitle. It may overlap with PANL10N corpus. The dataset has versions with raw and tokenized sentences, and in CoNLL format.
  5. IWSLT2017 [paper]. ID-EN. ~100K sentences. TEDtalk subtitles (spoken language). NOTE: the test set tst2017-plus provided contains a small part of the train data (as mentioned here).
  6. Asian Language Treebank [paper]. ID, EN, and some Asian languages (mostly South East Asian). 20K sentences. Domain: News.

Word normalization

  1. Colloquial Indonesian Lexicon. This lexicon consists of 3592 unique colloquial tokens that are mapped onto 1742 unique lemmas. The full description of this lexicon can be seen in the paper.

Text summarization

  1. IndoSum. A collection of 20K online news article-summary pairs belonging to 6 categories and 10 sources. It has both abstractive summaries and extractive labels.

Text classification

  1. SMS Spam. This corpus contains 1143 sentences that have been labeled with normal message, fraud, promotion. It is provided by Yudi Wibisono
  2. Hate Speech Detection. This dataset consists of 713 tweets in the Indonesian language with 453 non hate speech and 260 hate speech tweets.
  3. Abusive Language Detection. A collection of tweets for abusive language detection in Indonesian social media. It consists of two types of labeling, abusive/not abusive and not abusive/abusive but not offensive/offensive. It also has its own colloquial Indonesian lexicon.

Speech recognition

  1. TITML-IDN speech corpus. The corpus contains 20 speakers (11 male and 9 female), where each of the speaker speaks 343 utterances. The utterances are phonetically balanced. The corpus itself is free to use for academic/non-commercial usage, but interested party should make a formal request via email to the institution. The procedure is listed here.
  2. Indonesian Speech Recognition. A small corpus of 50 utterances by a single male speaker. Disclaimer: This is a school project, do not use it for any important tasks. The author is not responsible for the undesired results of using the data provided here.
  3. CMU Wilderness Multilingual Speech Dataset. A dataset of over 700 different languages providing audio, aligned texts, and word pronunciations. One of the languages is Indonesian. The utterances are from the bible, which is recorded by bible.is.

Paraphrase identification

  1. Translated PAWS. This dataset is a translation of PAWS. The dataset is translated using Google Translate and contains 100K human-labeled data that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].