DANeSDANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)
Stars: ✭ 64 (+72.97%)
Weibo terminaterFinal Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator
Stars: ✭ 2,295 (+6102.7%)
BSDThe Business Scene Dialogue corpus
Stars: ✭ 51 (+37.84%)
tvsubTVsub: DCU-Tencent Chinese-English Dialogue Corpus
Stars: ✭ 40 (+8.11%)
megsA merged version of multiple open-source German speech datasets.
Stars: ✭ 21 (-43.24%)
PoetryCorpusПоэтический корпус русского языка
Stars: ✭ 40 (+8.11%)
trafilaturaPython & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Stars: ✭ 711 (+1821.62%)
TV4DialogNo description or website provided.
Stars: ✭ 33 (-10.81%)
rclcRich Context leaderboard competition, including the corpus and current SOTA for required tasks.
Stars: ✭ 20 (-45.95%)
thaigov-corpusโครงการเก็บรวบรวมข่าวสารจากเว็บไซต์รัฐบาลไทย
Stars: ✭ 19 (-48.65%)
Chinese Names Corpus中文人名语料库。人名生成器。中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。可用于中文分词、人名实体识别。
Stars: ✭ 3,053 (+8151.35%)
cljs-corpusA greppable archive of ClojureScript code
Stars: ✭ 37 (+0%)
Efaqa Corpus Zh❤️Emotional First Aid Dataset, 心理咨询问答、聊天机器人语料库
Stars: ✭ 170 (+359.46%)
textboxText collections made available by the CLiGS group.
Stars: ✭ 19 (-48.65%)
Clue中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+6454.05%)
CBLUE中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Stars: ✭ 379 (+924.32%)
Awesome ChatbotAwesome Chatbot Projects,Corpus,Papers,Tutorials.Chinese Chatbot =>:
Stars: ✭ 1,785 (+4724.32%)
nytwitNew York Times Word Innovation Types dataset
Stars: ✭ 21 (-43.24%)
opensource-voice-toolsA repo listing known open source voice tools, ordered by where they sit in the voice stack
Stars: ✭ 21 (-43.24%)
Dialog corpus用于训练中英文对话系统的语料库 Datasets for Training Chatbot System
Stars: ✭ 1,662 (+4391.89%)
Customer-Feedback-AnalysisMulti Class Text (Feedback) Classification using CNN, GRU Network and pre trained Word2Vec embedding, word embeddings on TensorFlow.
Stars: ✭ 18 (-51.35%)
bible-corpusA multilingual parallel corpus created from translations of the Bible.
Stars: ✭ 115 (+210.81%)
LanguageCodesWe present a list of languages with their codes, families, regions and etc. We also present a list of multi-lingual corpora (with urls).
Stars: ✭ 70 (+89.19%)
proiel-treebankOfficial releases of the PROIEL treebank of ancient Indo-European languages
Stars: ✭ 30 (-18.92%)
kanji-frequencyKanji usage frequency data collected from various sources
Stars: ✭ 92 (+148.65%)
german-nounsA list of ~100,000 German nouns and their grammatical properties compiled from WiktionaryDE as CSV file. Plus a module to look up the data and parse compound words.
Stars: ✭ 101 (+172.97%)
CNN-Sentence-ClassificationA tensorflow implementation of Convolutional Neural Networks for Sentence Classification
Stars: ✭ 77 (+108.11%)
thai-languagecomputer tools for thai language
Stars: ✭ 20 (-45.95%)
NlvrCornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.
Stars: ✭ 192 (+418.92%)
When-in-RomeA meta-corpus of functional harmonic analysis.
Stars: ✭ 35 (-5.41%)
Nlp bahasa resourcesA Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Stars: ✭ 158 (+327.03%)
pdf-corpusPython script to quickly create hand-crafted PDF files
Stars: ✭ 17 (-54.05%)
Wp2txtWP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata.
Stars: ✭ 145 (+291.89%)
malay-datasetText corpus for Bahasa Malaysia, https://malaya.readthedocs.io/en/latest/Dataset.html
Stars: ✭ 189 (+410.81%)
ProsodyHelsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Stars: ✭ 139 (+275.68%)
KARENKAREN: Unifying Hatespeech Detection and Benchmarking
Stars: ✭ 18 (-51.35%)
Code Docstring CorpusPreprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks.
Stars: ✭ 137 (+270.27%)
gumRepository for the Georgetown University Multilayer Corpus (GUM)
Stars: ✭ 71 (+91.89%)
KhcoderKH Coder: for Quantitative Content Analysis or Text Mining
Stars: ✭ 126 (+240.54%)
ocr2textConvert a PDF via OCR to a TXT file in UTF-8 encoding
Stars: ✭ 90 (+143.24%)
foliaFoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for proces…
Stars: ✭ 56 (+51.35%)
NSP-BERTThe code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"
Stars: ✭ 166 (+348.65%)
KWDLCKyoto University Web Document Leads Corpus
Stars: ✭ 64 (+72.97%)
jrte-corpusJapanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
Stars: ✭ 66 (+78.38%)
OpenConvertText conversion tool (from e.g. Word, HTML, txt) to corpus formats TEI or FoLiA)
Stars: ✭ 20 (-45.95%)