hailiang-wang / egret-wenda-corpus

Licence: Apache-2.0 license

A Public Corpus for Machine Learning

Programming Languages

184084 projects - #8 most used programming language

Projects that are alternatives of or similar to egret-wenda-corpus

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

Stars: ✭ 64 (+56.1%)

Mutual labels: corpus, corpus-data

Cluedatasetsearch

搜索所有中文NLP数据集，附常用英文NLP数据集

Stars: ✭ 2,112 (+5051.22%)

Mutual labels: qa, corpus

bento

🍱 bento is an English-based automation language designed to be used by non-technical people.

Stars: ✭ 32 (-21.95%)

Mutual labels: qa

CEEC-Corpus

📚中文环境突发事件语料库（Chinese Environment Emergency Corpus）-上海大学-语义智能实验室

Stars: ✭ 41 (+0%)

Mutual labels: corpus-data

FoQA

Container for Quality Assurance utilities to be included in QA/testing variants of Android apps.

Stars: ✭ 15 (-63.41%)

Mutual labels: qa

Chatbot

基于语义理解、知识图谱的聊天机器人

Stars: ✭ 30 (-26.83%)

Mutual labels: qa

LanguageCodes

We present a list of languages with their codes, families, regions and etc. We also present a list of multi-lingual corpora (with urls).

Stars: ✭ 70 (+70.73%)

Mutual labels: corpus

When-in-Rome

A meta-corpus of functional harmonic analysis.

Stars: ✭ 35 (-14.63%)

Mutual labels: corpus

jrte-corpus

Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)

Stars: ✭ 66 (+60.98%)

Mutual labels: corpus

text-classification-cn

中文文本分类实践，基于搜狗新闻语料库，采用传统机器学习方法以及预训练模型等方法

Stars: ✭ 81 (+97.56%)

Mutual labels: corpus

babel-plugin-transform-react-qa-classes

Add component's name in `data-qa` attributes to React Components

Stars: ✭ 41 (+0%)

Mutual labels: qa

kanji-frequency

Kanji usage frequency data collected from various sources

Stars: ✭ 92 (+124.39%)

Mutual labels: corpus

thaigov-corpus

โครงการเก็บรวบรวมข่าวสารจากเว็บไซต์รัฐบาลไทย

Stars: ✭ 19 (-53.66%)

Mutual labels: corpus

TV4Dialog

No description or website provided.

Stars: ✭ 33 (-19.51%)

Mutual labels: corpus

mev-corpus

MEV Data Corpus

Stars: ✭ 77 (+87.8%)

Mutual labels: corpus

lemoncheesecake

Python framework for end-to-end / QA testing

Stars: ✭ 37 (-9.76%)

Mutual labels: qa

BSD

The Business Scene Dialogue corpus

Stars: ✭ 51 (+24.39%)

Mutual labels: corpus

pysys-test

PySys System Test Framework

Stars: ✭ 14 (-65.85%)

Mutual labels: qa

dialogbot

dialogbot, provide search-based dialogue, task-based dialogue and generative dialogue model. 对话机器人，基于问答型对话、任务型对话、聊天型对话等模型实现，支持网络检索问答，领域知识问答，任务引导问答，闲聊问答，开箱即用。

Stars: ✭ 96 (+134.15%)

Mutual labels: qa

FAQ-Bot-QQ

一个基于Mirai框架的Q群问答机器人

Stars: ✭ 30 (-26.83%)

Mutual labels: qa

View All Similar Projects ➔

重要提示

训练机器学习模型，评测算法和交流，可以使用另外一个质量更好的语料库了 - 机器学习保险行业问答开放数据集

Egret Wenda Corpus

中文问答语料

QA Corpus, based on egret bbs.

在做机器学习的过程中，训练问答机器人的过程往往需要高质量的数据。针对英文，有很多庞大的预料库，针对中文，公开的资料很少。在学习的过程中，我接触到了Ubuntu Dialogue Corpus，这也启发在技术社区挖掘出一些数据，制作语料。

目前这版语料，是从白鹭时代官方论坛问答板块10,000+ 问题中，选择被标注了“最佳答案”的纪录汇总而成。

使用爬虫将目标数据存储到数据库
从数据库生成raw data
人工review raw data，给每一个问题，一个可以接受的答案。

目前，语料库包含2907个问答，虽然问题库很小，但针对一个垂直领域而言，也许足够了。

DESCRIPTION

In all files the field separator is " +++$+++ "

egret_wenda_lines.txt

- contains the actual text of each utterance
- fields:
	- lineID
	- person id (who uttered this phrase)
	- text of the utterance

egret_wenda_conversations.txt

- the structure of the conversations
- fields
	- conversationId
	- person id of the first character involved in the conversation
	- person id of the second character involved in the conversation
	- date of the post
	- source of this conversation in URL
	- list of the utterances that make the conversation, in chronological 
		order: ['Question lineID','Answer lineID']
		has to be matched with egret_wenda_lines.txt to reconstruct the actual content

What's more

Data in raw are raw data from BBS.

To make it more suitable for training, I have personally reviewed the raw data and modify some utterances, such as deleting codes in utterances.

processer.js

Generate raw data from data collection, the data collection is built with Egret问答专区.

Tips

NOTE: If you have results to report on these corpora, please send email to [email protected], so I can add you to list of people using this data.

Thanks!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

hailiang-wang / egret-wenda-corpus

Programming Languages

Labels

Projects that are alternatives of or similar to egret-wenda-corpus

重要提示

Egret Wenda Corpus

DESCRIPTION

egret_wenda_lines.txt

egret_wenda_conversations.txt

What's more

Data in raw are raw data from BBS.

processer.js

Tips