All Categories → Data Processing → corpus

Top 107 corpus open source projects

Repository for the experiments described in the paper named "DeepSentiPers: Novel Deep Learning Models Trained Over Proposed Augmented Persian Sentiment Corpus"

✭ 17

Jupyter Notebook deep-neural-networks sentiment-analysis neural-network keras corpus cnn dataset lstm classification opinion-mining score data-augmentation polarity wordembeddings architectures fasttext-embeddings persian-sentiment persian-sentiment-analysis

dialogue-datasets

collect the open dialog corpus and some useful data processing utils.

✭ 24

python dialogue dialog corpus conversation dialogue-systems twitter-crawler multi-turn single-turn

Filipino-Text-Benchmarks

Open-source benchmark datasets and pretrained transformer models in the Filipino language.

✭ 22

python benchmark deep-learning text-classification corpus transformer transfer-learning tagalog bert filipino electra nli low-resource-languages tagalog-transformers electra-models

fuzzing-corpus

My fuzzing corpus

✭ 120

javascript ruby HTML assembly c Rich Text Format corpus file-format fuzzing vulnerability testsuite

SpiCE-Corpus

An open-access corpus of conversational bilingual speech in Cantonese and English

✭ 33

javascript HTML CSS corpus english-language cantonese-language bilingual-corpora speech-corpus spice-corpus

OpenDialog

An Open-Source Package for Chinese Open-domain Conversational Chatbot (中文闲聊对话系统，一键部署微信闲聊机器人)

✭ 94

python shell reinforcement-learning retrieval corpus transformers pytorch wechat chinese generative bert wechat-api multi-view open-domain gpt2 gan-based

OneStopEnglishCorpus

No description or website provided.

✭ 38

paper corpus

PubMed-PICO-Detection

PubMed PICO Element Detection Dataset

✭ 37

nlp machine-learning deep-learning corpus sentence-classification

folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for proces…

✭ 56

python shell nlp library xml corpus linguistics file-format computational-linguistics folia linguistic-annotation-framework

thai-language

computer tools for thai language

✭ 20

python corpus opennlp thai-language linguistic-corpora

named-entity-recognition-template

Build a deep learning model for predicting the named entities from text.

✭ 51

Jupyter Notebook nlp machine-learning deep-learning tensorflow keras corpus named-entity-recognition floydhub lstm-crf-model

cljs-corpus

A greppable archive of ClojureScript code

✭ 37

corpus

KWDLC

Kyoto University Web Document Leads Corpus

✭ 64

japanese corpus named-entities part-of-speech morphological-analysis dependency-parsing

bible-corpus

A multilingual parallel corpus created from translations of the Bible.

✭ 115

multilingual translation corpus bible bible-corpus

CLUEmotionAnalysis2020

CLUE Emotion Analysis Dataset 细粒度情感分析数据集

✭ 3

python Jupyter Notebook shell sentiment-analysis corpus dataset chinese fine-grained emotion-recognition

PoetryCorpus

Поэтический корпус русского языка

✭ 40

python HTML javascript CSS shell nlp docker django corpus

pdf-corpus

Python script to quickly create hand-crafted PDF files

✭ 17

python Makefile pdf test-suite corpus

CBLUE

中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark

✭ 379

python benchmark evaluation corpus dataset chinese chineseblue biomedical-tasks acl2022

egret-wenda-corpus

A Public Corpus for Machine Learning

✭ 41

javascript qa corpus corpus-data

jrte-corpus

Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)

✭ 66

python Makefile natural-language-processing corpus japanese-language sentiment-polarity textual-entailment

TV4Dialog

No description or website provided.

✭ 33

SRecode Template dialogue corpus subtitles chinese

LanguageCodes

We present a list of languages with their codes, families, regions and etc. We also present a list of multi-lingual corpora (with urls).

✭ 70

multi-lingual corpus language-codes

text-classification-cn

中文文本分类实践，基于搜狗新闻语料库，采用传统机器学习方法以及预训练模型等方法

✭ 81

python nlp machine-learning deep-learning text-classification svm word2vec naive-bayes scikit-learn keras corpus cnn logistic-regression tf-idf sogou embedding pretrained text-cnn keras-cnn embedding-layers

kanji-frequency

Kanji usage frequency data collected from various sources

✭ 92

javascript HTML Less coffeescript shell data japanese corpus data-visualization cjk kanji japanese-language corpus-linguistics frequency-lists cjk-characters kanji-frequency

thaigov-corpus

โครงการเก็บรวบรวมข่าวสารจากเว็บไซต์รัฐบาลไทย

✭ 19

python corpus thailand thai-language thai thai-nlp

mev-corpus

MEV Data Corpus

✭ 77

javascript python shell typescript data ethereum blockchain corpus mev flashbots miner-extracted-value

BSD

The Business Scene Dialogue corpus

✭ 51

japanese machine-translation corpus english parallel-corpus parallel-corpora annotated-corpora document-aligned

When-in-Rome

A meta-corpus of functional harmonic analysis.

✭ 35

python music harmony corpus dataset

textbox

Text collections made available by the CLiGS group.

✭ 19

xml corpus spanish french digital-humanities portuguese text-collection literary-studies

malay-dataset

Text corpus for Bahasa Malaysia, https://malaya.readthedocs.io/en/latest/Dataset.html

✭ 189

Jupyter Notebook HTML Rich Text Format python text-mining corpus malaysia bahasa malay-dataset

open2ch-dialogue-corpus

おーぷん2ちゃんねるをクロールして作成した対話コーパス

✭ 65

python japanese dialogue corpus datasets

gum

Repository for the Georgetown University Multilayer Corpus (GUM)

✭ 71

python XSLT cython javascript HTML CSS annotations corpus treebank pos-tagging rhetorical-structure-theory coreference annis

nytwit

New York Times Word Innovation Types dataset

✭ 21

nlp news corpus dataset computational-linguistics

ocr2text

Convert a PDF via OCR to a TXT file in UTF-8 encoding

✭ 90

python pdf converter ocr corpus tesseract batch

OpenConvert

Text conversion tool (from e.g. Word, HTML, txt) to corpus formats TEI or FoLiA)

✭ 20

java XSLT conversion corpus

opensource-voice-tools

A repo listing known open source voice tools, ordered by where they sit in the voice stack

✭ 21

TeX chatbot voice corpus speech conversational-ui tts speech-recognition stt asr

trafilatura

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

Chatbot-Training-Corpus

总结了一些可以用作聊天机器人训练实作的文字语聊，包含中英文不同语言

✭ 117

python chatbot dialogue corpus

tvsub

TVsub: DCU-Tencent Chinese-English Dialogue Corpus

✭ 40

machine-translation corpus chinese-english tv-subtitle

Speech-Corpus-Collection

A Collection of Speech Corpus for ASR and TTS

✭ 113

corpus tts dataset asr

proiel-treebank

Official releases of the PROIEL treebank of ancient Indo-European languages

✭ 30

corpus linguistics latin treebank ancient-greek armenian new-testament gothic2 ancient-languages old-church-slavonic

DANeS

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

✭ 64

python open-source machine-learning natural-language-processing corpus artificial-intelligence dataset newspaper corpus-data text-sentiment danes datasetvn aivgroup

Probabilistic-RNN-DA-Classifier

Probabilistic Dialogue Act Classification for the Switchboard Corpus using an LSTM model

✭ 22

python dialogue keras corpus recurrent-neural-networks embeddings lstm classification rnn lstm-model probabilistic dialogue-data utterances

rclc

Rich Context leaderboard competition, including the corpus and current SOTA for required tasks.

✭ 20

python shell nlp competition leaderboard corpus knowledge-graph metadata-extraction entity-linking rich-context dataset-ids

german-nouns

A list of ~100,000 German nouns and their grammatical properties compiled from WiktionaryDE as CSV file. Plus a module to look up the data and parse compound words.

✭ 101

python parser corpus nouns wiktionary german-language german-nouns

megs

A merged version of multiple open-source German speech datasets.

✭ 21

Jupyter Notebook python shell corpus dataset speech-recognition speech-to-text asr

Dialogue-Corpus

No description or website provided.

✭ 27

python dialogue corpus conversation opensubtitles ubuntu-dialog-corpus

61-107 of 107 corpus projects

‹