All Projects → hailiang-wang → egret-wenda-corpus

hailiang-wang / egret-wenda-corpus

Licence: Apache-2.0 license
A Public Corpus for Machine Learning

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to egret-wenda-corpus

DANeS
DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)
Stars: ✭ 64 (+56.1%)
Mutual labels:  corpus, corpus-data
Cluedatasetsearch
搜索所有中文NLP数据集,附常用英文NLP数据集
Stars: ✭ 2,112 (+5051.22%)
Mutual labels:  qa, corpus
bento
🍱 bento is an English-based automation language designed to be used by non-technical people.
Stars: ✭ 32 (-21.95%)
Mutual labels:  qa
CEEC-Corpus
📚中文环境突发事件语料库(Chinese Environment Emergency Corpus)-上海大学-语义智能实验室
Stars: ✭ 41 (+0%)
Mutual labels:  corpus-data
FoQA
Container for Quality Assurance utilities to be included in QA/testing variants of Android apps.
Stars: ✭ 15 (-63.41%)
Mutual labels:  qa
Chatbot
基于语义理解、知识图谱的聊天机器人
Stars: ✭ 30 (-26.83%)
Mutual labels:  qa
LanguageCodes
We present a list of languages with their codes, families, regions and etc. We also present a list of multi-lingual corpora (with urls).
Stars: ✭ 70 (+70.73%)
Mutual labels:  corpus
When-in-Rome
A meta-corpus of functional harmonic analysis.
Stars: ✭ 35 (-14.63%)
Mutual labels:  corpus
jrte-corpus
Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
Stars: ✭ 66 (+60.98%)
Mutual labels:  corpus
text-classification-cn
中文文本分类实践,基于搜狗新闻语料库,采用传统机器学习方法以及预训练模型等方法
Stars: ✭ 81 (+97.56%)
Mutual labels:  corpus
babel-plugin-transform-react-qa-classes
Add component's name in `data-qa` attributes to React Components
Stars: ✭ 41 (+0%)
Mutual labels:  qa
kanji-frequency
Kanji usage frequency data collected from various sources
Stars: ✭ 92 (+124.39%)
Mutual labels:  corpus
thaigov-corpus
โครงการเก็บรวบรวมข่าวสารจากเว็บไซต์รัฐบาลไทย
Stars: ✭ 19 (-53.66%)
Mutual labels:  corpus
TV4Dialog
No description or website provided.
Stars: ✭ 33 (-19.51%)
Mutual labels:  corpus
mev-corpus
MEV Data Corpus
Stars: ✭ 77 (+87.8%)
Mutual labels:  corpus
lemoncheesecake
Python framework for end-to-end / QA testing
Stars: ✭ 37 (-9.76%)
Mutual labels:  qa
BSD
The Business Scene Dialogue corpus
Stars: ✭ 51 (+24.39%)
Mutual labels:  corpus
pysys-test
PySys System Test Framework
Stars: ✭ 14 (-65.85%)
Mutual labels:  qa
dialogbot
dialogbot, provide search-based dialogue, task-based dialogue and generative dialogue model. 对话机器人,基于问答型对话、任务型对话、聊天型对话等模型实现,支持网络检索问答,领域知识问答,任务引导问答,闲聊问答,开箱即用。
Stars: ✭ 96 (+134.15%)
Mutual labels:  qa
FAQ-Bot-QQ
一个基于Mirai框架的Q群问答机器人
Stars: ✭ 30 (-26.83%)
Mutual labels:  qa

重要提示

训练机器学习模型,评测算法和交流,可以使用另外一个质量更好的语料库了 - 机器学习保险行业问答开放数据集

chatoper banner

Egret Wenda Corpus

中文问答语料

QA Corpus, based on egret bbs.

在做机器学习的过程中,训练问答机器人的过程往往需要高质量的数据。针对英文,有很多庞大的预料库,针对中文,公开的资料很少。 在学习的过程中,我接触到了Ubuntu Dialogue Corpus,这也启发在技术社区挖掘出一些数据,制作语料。

目前这版语料,是从白鹭时代官方论坛问答板块10,000+ 问题中,选择被标注了“最佳答案”的纪录汇总而成。

  • 使用爬虫将目标数据存储到数据库
  • 从数据库生成raw data
  • 人工review raw data,给每一个问题,一个可以接受的答案。

目前,语料库包含2907个问答,虽然问题库很小,但针对一个垂直领域而言,也许足够了。

DESCRIPTION

In all files the field separator is " +++$+++ "

egret_wenda_lines.txt

- contains the actual text of each utterance
- fields:
	- lineID
	- person id (who uttered this phrase)
	- text of the utterance

egret_wenda_conversations.txt

- the structure of the conversations
- fields
	- conversationId
	- person id of the first character involved in the conversation
	- person id of the second character involved in the conversation
	- date of the post
	- source of this conversation in URL
	- list of the utterances that make the conversation, in chronological 
		order: ['Question lineID','Answer lineID']
		has to be matched with egret_wenda_lines.txt to reconstruct the actual content

What's more

Data in raw are raw data from BBS.

To make it more suitable for training, I have personally reviewed the raw data and modify some utterances, such as deleting codes in utterances.

processer.js

Generate raw data from data collection, the data collection is built with Egret问答专区.

Tips

NOTE: If you have results to report on these corpora, please send email to [email protected], so I can add you to list of people using this data.

Thanks!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].