Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

The purpose of this repository is to make prototypes as case study in the context of proof of concept(PoC) and research and development(R&D) that I have written in my website. The main research topics are Auto-Encoders in relation to the representation learning, the statistical machine learning for energy-based models, adversarial generation networks(GANs), Deep Reinforcement Learning such as Deep Q-Networks, semi-supervised learning, and neural network language model for natural language processing.

Stars: ✭ 166 (-22.79%)

Mutual labels: transfer-learning

Kashgari

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Stars: ✭ 2,235 (+939.53%)

Mutual labels: transfer-learning

Customer Chatbot

中文智能客服机器人demo，包含闲聊和专业问答2个部分，支持自定义组件（Chinese intelligent customer chatbot Demo, including the gossip and the professional Q&A(FAQ) , support for custom components！）

Stars: ✭ 198 (-7.91%)

Mutual labels: qa

Stackneveroverflow

A simple Q&A platform using Ruby on Rails, support markdown.

Stars: ✭ 164 (-23.72%)

Mutual labels: qa

Face.evolve.pytorch

🔥🔥High-Performance Face Recognition Library on PaddlePaddle & PyTorch🔥🔥

Stars: ✭ 2,719 (+1164.65%)

Mutual labels: transfer-learning

Transferlearning Tutorial

《迁移学习简明手册》LaTex源码

Stars: ✭ 2,122 (+886.98%)

Mutual labels: transfer-learning

Bert Sklearn

a sklearn wrapper for Google's BERT model

Stars: ✭ 182 (-15.35%)

Mutual labels: transfer-learning

Deepfake Detection

Towards deepfake detection that actually works

Stars: ✭ 213 (-0.93%)

Mutual labels: transfer-learning

Chinese ulmfit

中文ULMFiT 情感分析文本分类

Stars: ✭ 208 (-3.26%)

Mutual labels: transfer-learning

Reporting

Zebrunner Reporting Tool

Stars: ✭ 198 (-7.91%)

Mutual labels: qa

View All Similar Projects ➔

Dureader-Bert

2019 Dureader机器阅读理解单模型代码。

哈工大讯飞联合实验室发布的中文全词覆盖BERT

论文地址
 预训练模型下载地址

只需将要加载的预训练模型换为压缩包内的chinese_wwm_pytorch.bin，即修改from_pretrained函数中weights_path和config_file即可。

谷歌发布的中文bert与哈工大发布的中文全词覆盖BERT在Dureader上的效果对比

模型	ROUGE-L	BLEU-4
谷歌bert	49.3	50.2
哈工大bert	50.32	51.4

由于官方没有给出测试集，实验数据是在验证集上跑出来的

许多人询问，说明一下：

1、数据处理是自己写的，不用squad的数据处理，可以换其他任何数据集，数据输入符合就行，也可以自己重写
2、比赛提升主要使用 Multi-task训练、以及答案抽取，由于代码繁重，故这份代码只有单任务训练
3、对于输出层我只使用了一层全连接，也可以自己修改为论文里的输出层，如下：

代码：

代码主要删减大量不必要代码，也将英文的数据处理改为中文的数据处理，方便阅读和掌握bert的代码。
handle_data文件夹是处理Dureader的数据，与比赛有关，与bert没有多大关系。
dataset文件夹是处理中文数据的代码，大致是将文字转化为bert的输入：(inputs_ids,token_type_ids,input_mask), 然后做成dataloader。
predict文件夹是用来预测的，基本与训练时差不多，一些细节不一样（输出）。
总的来说，只要输入符合bert的输入：(inputs_ids,token_type_ids,input_mask)就可以了。

小小提示：

竞赛最终结果第七名, ROUGE-L:53.62, BLEU-4:54.97
代码上传前已经跑通，所以如果碰到报错之类的信息，可能是代码路径不对、缺少安装包等问题，一步步解决，可以提issue。
若有提升模型效果的想法，十分欢迎前来交流（邮箱：[email protected]）

环境(不支持cpu)

python3
torch 1.0
依赖包 pytorch-pretrained-bert、tqdm、pickle、torchtext

Reference

Bert论文
 Dureader
Bert中文全词覆盖论文
 pytorch-pretrained-BERT

运行流程

一、数据处理：

将trainset、devset等数据放在data文件里 (data下的trainset、devset有部份数据，可以换成全部数据。)
到handle_data目录下运行 sh run.sh --para_extraction, 便会将处理后的数据放在extracted下的对应文件夹里

二、制作dataset：

到dataset目录下运行两次 python3 run_squad.py，分别生成train.data与dev.data,第一次运行结束后要修改run_squad.py的参数，具体做法run_squad.py末尾有具体说明

三、训练：

到root下运行 python3 train.py，边训练边验证

四、测试:

到predict目录下运行 python3 util.py (测试集太多，也可以在该文件里将路径改为验证集，默认为验证集路径)
运行 python3 predicting.py
到metric目录下，运行 python3 mrc_eval.py predicts.json ref.json v1 即可

排行榜：

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 215

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (21) 🔗