All Projects → circlePi → knowledge-driven-dialogue-lic2019

circlePi / knowledge-driven-dialogue-lic2019

Licence: other
2019 语言与智能技术竞赛-知识驱动对话 B榜第5名源码和模型

Programming Languages

python
139335 projects - #7 most used programming language
perl
6916 projects
shell
77523 projects
emacs lisp
2029 projects
smalltalk
420 projects
ruby
36898 projects - #4 most used programming language

knowledge-driven-dialogue-2019-lic

2019语言与智能技术竞赛知识驱动对话 B榜第5名方案
由于线上部署对时间有要求,最终提交人工评估的版本删掉了一些全局主题特征,导致模型结果有所下降,最终人工评估第9名。A榜第四 B榜第五

Overview

For building a proactive dialogue chatbot, we used a so-called generation-reranking method. First, the generative models(Multi-Seq2Seq) produce some candidate replies. Next, the re-ranking model is responsible for performing query-answer matching, to choice a reply as informative as possible over the produced candidates. A detailed paper to describle our solution is now avaliable at https://arxiv.org/pdf/1907.03590.pdf, please check.

Data Augmentation

We used four data augmentation techniques, Entity Generalization,Knowledge Selection,Switch,Conversation Extraction to construct multiple different dataset for training Seq2Seq models. One can use the scripts Seq2Seq/preclean_*.py to with slight modification of parameters to get 6 datasets.

Seq2Seq Model

For ensemble purpose we choose different encoders and decoders, i.e. LSTM cells and the Transformer.

Training

  • python preprocess.py
  • python train.py

Testing

python translate.py
All the config file of training & testing can be easily modified in the config/*.yml
In total, we trained 27 Seq2Seq model for ensemble.

Answer rank

We used a GBDT regressor for ranking. One may arugue that Why not use a neural network, such as BERT for ranking. Actually We tried, but it doesn't work well.

Creating ranking dataset

python create_gbdt_dataset.py

Feature extraction

python feature_util_multiprocess.py
The feature extractions partly reference the Kaggle_HomeDepot by ChenglongChen

Checkpoints

It might take some extra time to upload the checkpoints because they are rather large in size.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].