kirralabs / Indonesian Nlp Resources
Licence: mit
data resource untuk NLP bahasa indonesia
Stars: ✭ 143
Projects that are alternatives of or similar to Indonesian Nlp Resources
Nlp bahasa resources
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Stars: ✭ 158 (+10.49%)
Mutual labels: dataset, corpus, sentiment-analysis
Awesome Hungarian Nlp
A curated list of NLP resources for Hungarian
Stars: ✭ 121 (-15.38%)
Mutual labels: dataset, corpus, named-entity-recognition
Universal Data Tool
Collaborate & label any type of data, images, text, or documents, in an easy web interface or desktop app.
Stars: ✭ 1,356 (+848.25%)
Mutual labels: dataset, named-entity-recognition
Pynlp
A pythonic wrapper for Stanford CoreNLP.
Stars: ✭ 103 (-27.97%)
Mutual labels: sentiment-analysis, named-entity-recognition
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (-13.29%)
Mutual labels: sentiment-analysis, named-entity-recognition
Turkish Bert Nlp Pipeline
Bert-base NLP pipeline for Turkish, Ner, Sentiment Analysis, Question Answering etc.
Stars: ✭ 85 (-40.56%)
Mutual labels: sentiment-analysis, named-entity-recognition
Pytreebank
😡😇 Stanford Sentiment Treebank loader in Python
Stars: ✭ 93 (-34.97%)
Mutual labels: dataset, sentiment-analysis
Ua Gec
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Stars: ✭ 108 (-24.48%)
Mutual labels: dataset, corpus
Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+1595.8%)
Mutual labels: dataset, corpus
Dialog corpus
用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
Stars: ✭ 1,662 (+1062.24%)
Mutual labels: dataset, corpus
Cluedatasetsearch
搜索所有中文NLP数据集,附常用英文NLP数据集
Stars: ✭ 2,112 (+1376.92%)
Mutual labels: corpus, sentiment-analysis
Dataset List
lists of text corpus and more (mainly Japanese)
Stars: ✭ 84 (-41.26%)
Mutual labels: dataset, corpus
Wikipedia ner
📖 Labeled examples from wiki dumps in Python
Stars: ✭ 61 (-57.34%)
Mutual labels: dataset, named-entity-recognition
Bond
BOND: BERT-Assisted Open-Domain Name Entity Recognition with Distant Supervision
Stars: ✭ 96 (-32.87%)
Mutual labels: dataset, named-entity-recognition
Phonlp
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing (NAACL 2021)
Stars: ✭ 56 (-60.84%)
Mutual labels: named-entity-recognition, pos-tagging
Prosody
Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Stars: ✭ 139 (-2.8%)
Mutual labels: dataset, corpus
French Sentiment Analysis Dataset
A collection of over 1.5 Million tweets data translated to French, with their sentiment.
Stars: ✭ 35 (-75.52%)
Mutual labels: dataset, sentiment-analysis
Images Web Crawler
This package is a complete tool for creating a large dataset of images (specially designed -but not only- for machine learning enthusiasts). It can crawl the web, download images, rename / resize / covert the images and merge folders..
Stars: ✭ 51 (-64.34%)
Mutual labels: dataset, crawler
Camel tools
A suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi.
Stars: ✭ 124 (-13.29%)
Mutual labels: sentiment-analysis, named-entity-recognition
indonesian-NLP-resources
Data NLP for bahasa indonesia (last update 20 sep 2020)
Sentences Dataset
- leipzig indonesian sentence collectoin news articles, web articles, wikipedia data from 2008-2016
- wn-msa.sourceforge.net Wordnet Bahasa
- Quran indonesian quran translation (id.muntakhab, id.jalalayn, id.indonesian)
- Kompas online collection. This corpus contains Kompas online news articles from 2001-2002. See here for more info and citations.
- Tempo online collection. This corpus contains Tempo online news articles from 2000-2002. See here for more info and citations.
- corpus-frog-storytelling spoken text story telling
- TED-Multilingual-Parallel-Corpus Monolingual_data/Indonesian
- Opus Opus NLPL
- Sealang Sealang dataset
link
Word reference (kemdikbud)- Entri Dasar : 50.668 (45,02 %)
- Kata Turunan : 26.835 (23,85 %)
- Gabungan Kata : 31.492 (27,98 %)
- Peribahasa : 2.054 (1,83 %)
- Kiasan : 269 (0,24 %)
- Ungkapan : 1.131 (1,00 %)
- Varian : 89 (0,08 %)
- Entri Total : 112.538 (100,00 %)
- Makna Total : 131.533
- Contoh Total : 30.010
- Kategori Total : 234
- Makna Per Entri : 1,169
- Contoh Per Makna : 0,228
PUEBI word type )
Words dataset (- word class => word noun(18647), word verb(39070) = 57717 words
- word type => rootword(41409), derivative word(24913), compound words, Figure of speech, proverb, expression = 66322 words
- Word root => source#1.1 : sastrawi 29932 words ; source#1.2 : sastrawi 30342 words ; source#2 : SentiStrengthID 27979 words ; source#3 : serangkai 30342 words
- Word spaCy : id
- word : serangkai
- Word name : random-name
- Word Indo name : genderprediction
- Word Wiktionary : word id
- word compound =>
- Word Acronims =>
- Word Negative =>
- source#1.1 : 3829 words ; source#1.2 : 3523 words ; source#1.3 : 154 words ;
- source#2 : ID-OpinionWords 2402 words
- source#3 : 3523 words
- source#4 : 126 words
- Word Positive =>
- source#1.1 : 1678 words ; source#1.2 : 40 words ; source#1.3 : 1293 words ;
- source#2 : 1182 words
- source#3 : 1293 words
- Word Slang =>
- Stopwords =>
- Emoticon =>
- Name Entity =>
- source#1 : [Place] country
- source#1 : [Place] Wilayah-Administratif-Indonesia (provinces, villages, districts, regencies)
- source#2 : [Place] Indonesia-Postal-Code (provinces, cities, subdistricts, urbans)
- source#3 : [Place] indonesian-region
- source#3 : [Person] gender prediction
- source#4 : [Person] random name
- source#5 : [Person] title of name
- source#6 : [Person] degree
- source#7 : [Org] institution
Tagged dataset
- NER =>
- POS-TAG
- POS-TAG : famrashel/idn-tagged-corpus
- POS-TAG : pebbie/pebahasa ~600 sentence
- POS-TAG Parser : UniversalDependencies/UD_Indonesian-GSD ~4477 sentence
- Sentimen =>
- panl10n Pan Localization
- Acronyms : ramaprakoso/analisis-sentimen 4085 words
Parallel corpus Eng-Ind
Sentence Analyzer
- MALINDO_Morph
- morphind
- INDRA
- pujangga : An interface for InaNLP and Deeplearning4j's Word2Vec for Indonesian (Bahasa Indonesia) in the form of REST API.
- id-multi-label-hate-speech-and-abusive-language-detection : Here we provide our dataset for multi-label hate speech and abusive language detection in the Indonesian Twitter.
- kawat : A Word Analogy Task Dataset for Indonesian
Crawler Data
- Crawler Indonesian news portal
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].