Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → kirralabs → Indonesian Nlp Resources

kirralabs / Indonesian Nlp Resources

Licence: mit

data resource untuk NLP bahasa indonesia

Labels

nlp dataset crawler sentiment-analysis named-entity-recognition corpus pos-tagging

Projects that are alternatives of or similar to Indonesian Nlp Resources

Nlp bahasa resources

A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia

Stars: ✭ 158 (+10.49%)

Mutual labels: dataset, corpus, sentiment-analysis

Awesome Hungarian Nlp

A curated list of NLP resources for Hungarian

Stars: ✭ 121 (-15.38%)

Mutual labels: dataset, corpus, named-entity-recognition

Universal Data Tool

Collaborate & label any type of data, images, text, or documents, in an easy web interface or desktop app.

Stars: ✭ 1,356 (+848.25%)

Mutual labels: dataset, named-entity-recognition

Pynlp

A pythonic wrapper for Stanford CoreNLP.

Stars: ✭ 103 (-27.97%)

Mutual labels: sentiment-analysis, named-entity-recognition

Dan Jurafsky Chris Manning Nlp

My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.

Stars: ✭ 124 (-13.29%)

Mutual labels: sentiment-analysis, named-entity-recognition

Turkish Bert Nlp Pipeline

Bert-base NLP pipeline for Turkish, Ner, Sentiment Analysis, Question Answering etc.

Stars: ✭ 85 (-40.56%)

Mutual labels: sentiment-analysis, named-entity-recognition

Pytreebank

😡😇 Stanford Sentiment Treebank loader in Python

Stars: ✭ 93 (-34.97%)

Mutual labels: dataset, sentiment-analysis

Ua Gec

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Stars: ✭ 108 (-24.48%)

Mutual labels: dataset, corpus

Coarij

Corpus of Annual Reports in Japan

Stars: ✭ 55 (-61.54%)

Mutual labels: dataset, corpus

Clue

中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard

Stars: ✭ 2,425 (+1595.8%)

Mutual labels: dataset, corpus

Dialog corpus

用于训练中英文对话系统的语料库 Datasets for Training Chatbot System

Stars: ✭ 1,662 (+1062.24%)

Mutual labels: dataset, corpus

Cluedatasetsearch

搜索所有中文NLP数据集，附常用英文NLP数据集

Stars: ✭ 2,112 (+1376.92%)

Mutual labels: corpus, sentiment-analysis

Dataset List

lists of text corpus and more (mainly Japanese)

Stars: ✭ 84 (-41.26%)

Mutual labels: dataset, corpus

Wikipedia ner

📖 Labeled examples from wiki dumps in Python

Stars: ✭ 61 (-57.34%)

Mutual labels: dataset, named-entity-recognition

Bond

BOND: BERT-Assisted Open-Domain Name Entity Recognition with Distant Supervision

Stars: ✭ 96 (-32.87%)

Mutual labels: dataset, named-entity-recognition

Phonlp

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing (NAACL 2021)

Stars: ✭ 56 (-60.84%)

Mutual labels: named-entity-recognition, pos-tagging

Prosody

Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text

Stars: ✭ 139 (-2.8%)

Mutual labels: dataset, corpus

French Sentiment Analysis Dataset

A collection of over 1.5 Million tweets data translated to French, with their sentiment.

Stars: ✭ 35 (-75.52%)

Mutual labels: dataset, sentiment-analysis

Images Web Crawler

This package is a complete tool for creating a large dataset of images (specially designed -but not only- for machine learning enthusiasts). It can crawl the web, download images, rename / resize / covert the images and merge folders..

Stars: ✭ 51 (-64.34%)

Mutual labels: dataset, crawler

Camel tools

A suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi.

Stars: ✭ 124 (-13.29%)

Mutual labels: sentiment-analysis, named-entity-recognition

View All Similar Projects ➔

indonesian-NLP-resources

Data NLP for bahasa indonesia (last update 20 sep 2020)

Sentences Dataset

leipzig indonesian sentence collectoin news articles, web articles, wikipedia data from 2008-2016
wn-msa.sourceforge.net Wordnet Bahasa
Quran indonesian quran translation (id.muntakhab, id.jalalayn, id.indonesian)
Kompas online collection. This corpus contains Kompas online news articles from 2001-2002. See here for more info and citations.
Tempo online collection. This corpus contains Tempo online news articles from 2000-2002. See here for more info and citations.
corpus-frog-storytelling spoken text story telling
TED-Multilingual-Parallel-Corpus Monolingual_data/Indonesian
Opus Opus NLPL
Sealang Sealang dataset

Words dataset (PUEBI word type )

word class => word noun(18647), word verb(39070) = 57717 words
word type => rootword(41409), derivative word(24913), compound words, Figure of speech, proverb, expression = 66322 words
Word root => source#1.1 : sastrawi 29932 words ; source#1.2 : sastrawi 30342 words ; source#2 : SentiStrengthID 27979 words ; source#3 : serangkai 30342 words
Word spaCy : id
word : serangkai
Word name : random-name
Word Indo name : genderprediction
Word Wiktionary : word id
word compound =>
- source#1 : 71 words
- source#2 : puebi
Word Acronims =>
- source#1 : 4085 words ;
- source#2 : 70 words
Word Negative =>
- source#1.1 : 3829 words ; source#1.2 : 3523 words ; source#1.3 : 154 words ;
- source#2 : ID-OpinionWords 2402 words
- source#3 : 3523 words
- source#4 : 126 words
Word Positive =>
- source#1.1 : 1678 words ; source#1.2 : 40 words ; source#1.3 : 1293 words ;
- source#2 : 1182 words
- source#3 : 1293 words
Word Slang =>
- source#1 : 1319 words ;
- source#2 : 286 words ;
- source#3 : 1147 words
- source#4 : 62 words
- source#4 : 15167 words
Stopwords =>
- source#1 : spacy data ;
- source#2 : 759 words ;
- source#3 : 399 words ;
- source#4 : 759+329+124+126 words
Emoticon =>
- source#1 : 252 ;
- source#2 : 3018 ;
- source#3 : 123
Name Entity =>
- source#1 : [Place] country
- source#1 : [Place] Wilayah-Administratif-Indonesia (provinces, villages, districts, regencies)
- source#2 : [Place] Indonesia-Postal-Code (provinces, cities, subdistricts, urbans)
- source#3 : [Place] indonesian-region
- source#3 : [Person] gender prediction
- source#4 : [Person] random name
- source#5 : [Person] title of name
- source#6 : [Person] degree
- source#7 : [Org] institution

Tagged dataset

NER =>
- source#1 : yohanesgultom/nlp-experiments 1700 sentences
- source#2 : yusufsyaifudin/indonesia-ner 1835 sentences
POS-TAG
- POS-TAG : famrashel/idn-tagged-corpus
- POS-TAG : pebbie/pebahasa ~600 sentence
- POS-TAG Parser : UniversalDependencies/UD_Indonesian-GSD ~4477 sentence
Sentimen =>
- source#1 : 1506 sentences ;
- source#2 : Sentiment word with strenght agusmakmun/SentiStrengthID 1573 (range : -5 until 5 ) ;
- source#3 : Sentiment with weight fajri91/InSet -> separate word list with weight of the strength (range : -5 until 5 ). 6610 negative words and 3619 positive words
panl10n Pan Localization
Acronyms : ramaprakoso/analisis-sentimen 4085 words

Parallel corpus Eng-Ind

Sentence Analyzer

MALINDO_Morph
morphind
INDRA
pujangga : An interface for InaNLP and Deeplearning4j's Word2Vec for Indonesian (Bahasa Indonesia) in the form of REST API.
id-multi-label-hate-speech-and-abusive-language-detection : Here we provide our dataset for multi-label hate speech and abusive language detection in the Indonesian Twitter.
kawat : A Word Analogy Task Dataset for Indonesian

Crawler Data

Crawler Indonesian news portal

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 143

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

kirralabs / Indonesian Nlp Resources

Labels

Projects that are alternatives of or similar to Indonesian Nlp Resources

indonesian-NLP-resources

Sentences Dataset

Word reference (kemdikbud) link

Words dataset (PUEBI word type )

Tagged dataset

Parallel corpus Eng-Ind

Sentence Analyzer

Crawler Data