foliaFoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for proces…
Stars: ✭ 56 (-18.84%)
pdf-corpusPython script to quickly create hand-crafted PDF files
Stars: ✭ 17 (-75.36%)
FakenewscorpusA dataset of millions of news articles scraped from a curated list of data sources.
Stars: ✭ 255 (+269.57%)
SpiCE-CorpusAn open-access corpus of conversational bilingual speech in Cantonese and English
Stars: ✭ 33 (-52.17%)
thaigov-corpusโครงการเก็บรวบรวมข่าวสารจากเว็บไซต์รัฐบาลไทย
Stars: ✭ 19 (-72.46%)
WordlessAn Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Stars: ✭ 378 (+447.83%)
KWDLCKyoto University Web Document Leads Corpus
Stars: ✭ 64 (-7.25%)
QuantedaAn R package for the Quantitative Analysis of Textual Data
Stars: ✭ 647 (+837.68%)
TV4DialogNo description or website provided.
Stars: ✭ 33 (-52.17%)
wordfish-pythonextract relationships from standardized terms from corpus of interest with deep learning 🐟
Stars: ✭ 19 (-72.46%)
Filipino-Text-BenchmarksOpen-source benchmark datasets and pretrained transformer models in the Filipino language.
Stars: ✭ 22 (-68.12%)
textboxText collections made available by the CLiGS group.
Stars: ✭ 19 (-72.46%)
Seq2seq ChatbotChatbot in 200 lines of code using TensorLayer
Stars: ✭ 777 (+1026.09%)
Cluecorpus2020Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料
Stars: ✭ 278 (+302.9%)
Lyrics CorporaAn unofficial Python API that allows users to create a corpus of lyrical text from their favorite artists and billboard charts
Stars: ✭ 13 (-81.16%)
Indian ParallelCorpusCurated list of publicly available parallel corpus for Indian Languages
Stars: ✭ 23 (-66.67%)
BSDThe Business Scene Dialogue corpus
Stars: ✭ 51 (-26.09%)
open-discourseOpen Discourse is the first fully comprehensive corpus of the plenary proceedings of the federal German Parliament (Bundestag).
Stars: ✭ 47 (-31.88%)
dialogue-datasetscollect the open dialog corpus and some useful data processing utils.
Stars: ✭ 24 (-65.22%)
malay-datasetText corpus for Bahasa Malaysia, https://malaya.readthedocs.io/en/latest/Dataset.html
Stars: ✭ 189 (+173.91%)
BookcorpusCrawl BookCorpus
Stars: ✭ 443 (+542.03%)
OpenDialogAn Open-Source Package for Chinese Open-domain Conversational Chatbot (中文闲聊对话系统,一键部署微信闲聊机器人)
Stars: ✭ 94 (+36.23%)
CorporaA collection of small corpuses of interesting data for the creation of bots and similar stuff.
Stars: ✭ 4,293 (+6121.74%)
Typing AssistantTyping Assistant provides the ability to autocomplete words and suggests predictions for the next word. This makes typing faster, more intelligent and reduces effort.
Stars: ✭ 32 (-53.62%)
thai-languagecomputer tools for thai language
Stars: ✭ 20 (-71.01%)
FuzzdataFuzzing resources for feeding various fuzzers with input. 🔧
Stars: ✭ 376 (+444.93%)
cljs-corpusA greppable archive of ClojureScript code
Stars: ✭ 37 (-46.38%)
Nlp chinese corpus大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+9546.38%)
bible-corpusA multilingual parallel corpus created from translations of the Bible.
Stars: ✭ 115 (+66.67%)
KorporaKorean corpus repository
Stars: ✭ 270 (+291.3%)
PoetryCorpusПоэтический корпус русского языка
Stars: ✭ 40 (-42.03%)
CBLUE中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Stars: ✭ 379 (+449.28%)
jrte-corpusJapanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
Stars: ✭ 66 (-4.35%)
LanguageCodesWe present a list of languages with their codes, families, regions and etc. We also present a list of multi-lingual corpora (with urls).
Stars: ✭ 70 (+1.45%)
kanji-frequencyKanji usage frequency data collected from various sources
Stars: ✭ 92 (+33.33%)
fastmorphFast corpus search engine originally made for the Corpus of Written Tatar language
Stars: ✭ 14 (-79.71%)
When-in-RomeA meta-corpus of functional harmonic analysis.
Stars: ✭ 35 (-49.28%)
CoarijCorpus of Annual Reports in Japan
Stars: ✭ 55 (-20.29%)
Naive Bayes ClassifierNaive Bayes classifier is classification algorithm. It uses Naive based Bernoulli and Multinomial equation to classify documents(Text) as ham or spam.
Stars: ✭ 6 (-91.3%)
Awesome Persian Nlp IrCurated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Stars: ✭ 460 (+566.67%)
DeepSentiPersRepository for the experiments described in the paper named "DeepSentiPers: Novel Deep Learning Models Trained Over Proposed Augmented Persian Sentiment Corpus"
Stars: ✭ 17 (-75.36%)