All Projects → kirralabs → Indonesian Nlp Resources

kirralabs / Indonesian Nlp Resources

Licence: mit
data resource untuk NLP bahasa indonesia

Projects that are alternatives of or similar to Indonesian Nlp Resources

Nlp bahasa resources
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Stars: ✭ 158 (+10.49%)
Mutual labels:  dataset, corpus, sentiment-analysis
Awesome Hungarian Nlp
A curated list of NLP resources for Hungarian
Stars: ✭ 121 (-15.38%)
Mutual labels:  dataset, corpus, named-entity-recognition
Universal Data Tool
Collaborate & label any type of data, images, text, or documents, in an easy web interface or desktop app.
Stars: ✭ 1,356 (+848.25%)
Mutual labels:  dataset, named-entity-recognition
Pynlp
A pythonic wrapper for Stanford CoreNLP.
Stars: ✭ 103 (-27.97%)
Mutual labels:  sentiment-analysis, named-entity-recognition
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (-13.29%)
Mutual labels:  sentiment-analysis, named-entity-recognition
Turkish Bert Nlp Pipeline
Bert-base NLP pipeline for Turkish, Ner, Sentiment Analysis, Question Answering etc.
Stars: ✭ 85 (-40.56%)
Mutual labels:  sentiment-analysis, named-entity-recognition
Pytreebank
😡😇 Stanford Sentiment Treebank loader in Python
Stars: ✭ 93 (-34.97%)
Mutual labels:  dataset, sentiment-analysis
Ua Gec
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Stars: ✭ 108 (-24.48%)
Mutual labels:  dataset, corpus
Coarij
Corpus of Annual Reports in Japan
Stars: ✭ 55 (-61.54%)
Mutual labels:  dataset, corpus
Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+1595.8%)
Mutual labels:  dataset, corpus
Dialog corpus
用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
Stars: ✭ 1,662 (+1062.24%)
Mutual labels:  dataset, corpus
Cluedatasetsearch
搜索所有中文NLP数据集,附常用英文NLP数据集
Stars: ✭ 2,112 (+1376.92%)
Mutual labels:  corpus, sentiment-analysis
Dataset List
lists of text corpus and more (mainly Japanese)
Stars: ✭ 84 (-41.26%)
Mutual labels:  dataset, corpus
Wikipedia ner
📖 Labeled examples from wiki dumps in Python
Stars: ✭ 61 (-57.34%)
Mutual labels:  dataset, named-entity-recognition
Bond
BOND: BERT-Assisted Open-Domain Name Entity Recognition with Distant Supervision
Stars: ✭ 96 (-32.87%)
Mutual labels:  dataset, named-entity-recognition
Phonlp
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing (NAACL 2021)
Stars: ✭ 56 (-60.84%)
Mutual labels:  named-entity-recognition, pos-tagging
Prosody
Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Stars: ✭ 139 (-2.8%)
Mutual labels:  dataset, corpus
French Sentiment Analysis Dataset
A collection of over 1.5 Million tweets data translated to French, with their sentiment.
Stars: ✭ 35 (-75.52%)
Mutual labels:  dataset, sentiment-analysis
Images Web Crawler
This package is a complete tool for creating a large dataset of images (specially designed -but not only- for machine learning enthusiasts). It can crawl the web, download images, rename / resize / covert the images and merge folders..
Stars: ✭ 51 (-64.34%)
Mutual labels:  dataset, crawler
Camel tools
A suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi.
Stars: ✭ 124 (-13.29%)
Mutual labels:  sentiment-analysis, named-entity-recognition

indonesian-NLP-resources

Data NLP for bahasa indonesia (last update 20 sep 2020)

Sentences Dataset

  1. leipzig indonesian sentence collectoin news articles, web articles, wikipedia data from 2008-2016
  2. wn-msa.sourceforge.net Wordnet Bahasa
  3. Quran indonesian quran translation (id.muntakhab, id.jalalayn, id.indonesian)
  4. Kompas online collection. This corpus contains Kompas online news articles from 2001-2002. See here for more info and citations.
  5. Tempo online collection. This corpus contains Tempo online news articles from 2000-2002. See here for more info and citations.
  6. corpus-frog-storytelling spoken text story telling
  7. TED-Multilingual-Parallel-Corpus Monolingual_data/Indonesian
  8. Opus Opus NLPL
  9. Sealang Sealang dataset

Word reference (kemdikbud) link

  1. Entri Dasar : 50.668 (45,02 %)
  2. Kata Turunan : 26.835 (23,85 %)
  3. Gabungan Kata : 31.492 (27,98 %)
  4. Peribahasa : 2.054 (1,83 %)
  5. Kiasan : 269 (0,24 %)
  6. Ungkapan : 1.131 (1,00 %)
  7. Varian : 89 (0,08 %)
  8. Entri Total : 112.538 (100,00 %)
  9. Makna Total : 131.533
  10. Contoh Total : 30.010
  11. Kategori Total : 234
  12. Makna Per Entri : 1,169
  13. Contoh Per Makna : 0,228

Words dataset (PUEBI word type )

  1. word class => word noun(18647), word verb(39070) = 57717 words
  2. word type => rootword(41409), derivative word(24913), compound words, Figure of speech, proverb, expression = 66322 words
  3. Word root => source#1.1 : sastrawi 29932 words ; source#1.2 : sastrawi 30342 words ; source#2 : SentiStrengthID 27979 words ; source#3 : serangkai 30342 words
  4. Word spaCy : id
  5. word : serangkai
  6. Word name : random-name
  7. Word Indo name : genderprediction
  8. Word Wiktionary : word id
  9. word compound =>
  10. Word Acronims =>
  11. Word Negative =>
  12. Word Positive =>
  13. Word Slang =>
  14. Stopwords =>
  15. Emoticon =>
  16. Name Entity =>
    • source#1 : [Place] country
    • source#1 : [Place] Wilayah-Administratif-Indonesia (provinces, villages, districts, regencies)
    • source#2 : [Place] Indonesia-Postal-Code (provinces, cities, subdistricts, urbans)
    • source#3 : [Place] indonesian-region
    • source#3 : [Person] gender prediction
    • source#4 : [Person] random name
    • source#5 : [Person] title of name
    • source#6 : [Person] degree
    • source#7 : [Org] institution

Tagged dataset

  1. NER =>
    • source#1 : yohanesgultom/nlp-experiments 1700 sentences
    • source#2 : yusufsyaifudin/indonesia-ner 1835 sentences
  2. POS-TAG
    • POS-TAG : famrashel/idn-tagged-corpus
    • POS-TAG : pebbie/pebahasa ~600 sentence
    • POS-TAG Parser : UniversalDependencies/UD_Indonesian-GSD ~4477 sentence
  3. Sentimen =>
    • source#1 : 1506 sentences ;
    • source#2 : Sentiment word with strenght agusmakmun/SentiStrengthID 1573 (range : -5 until 5 ) ;
    • source#3 : Sentiment with weight fajri91/InSet -> separate word list with weight of the strength (range : -5 until 5 ). 6610 negative words and 3619 positive words
  4. panl10n Pan Localization
  5. Acronyms : ramaprakoso/analisis-sentimen 4085 words

Parallel corpus Eng-Ind

  1. parallel-corpora-en-id
  2. Indonesian-English-Bilingual-Corpus
  3. TALPCo
  4. opus
  5. Multi-Wiki

Sentence Analyzer

  1. MALINDO_Morph
  2. morphind
  3. INDRA
  4. pujangga : An interface for InaNLP and Deeplearning4j's Word2Vec for Indonesian (Bahasa Indonesia) in the form of REST API.
  5. id-multi-label-hate-speech-and-abusive-language-detection : Here we provide our dataset for multi-label hate speech and abusive language detection in the Indonesian Twitter.
  6. kawat : A Word Analogy Task Dataset for Indonesian

Crawler Data

  1. Crawler Indonesian news portal
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].