All Projects → XiaoyuanYi → WMPoetry

XiaoyuanYi / WMPoetry

Licence: other
The source codes of Working Memory model for Chinese poetry generation (IJCAI 2018).

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to WMPoetry

Chinese financial sentiment dictionary
A Chinese financial sentiment word dictionary
Stars: ✭ 67 (+36.73%)
Mutual labels:  chinese
neural network papers
记录一些读过的论文,给出个人对论文的评分情况并简述论文insight
Stars: ✭ 152 (+210.2%)
Mutual labels:  chinese
LightLM
高性能小模型测评 Shared Tasks in NLPCC 2020. Task 1 - Light Pre-Training Chinese Language Model for NLP Task
Stars: ✭ 54 (+10.2%)
Mutual labels:  chinese
CLUEmotionAnalysis2020
CLUE Emotion Analysis Dataset 细粒度情感分析数据集
Stars: ✭ 3 (-93.88%)
Mutual labels:  chinese
GameWord
记录一下游戏常用单词的中英文对照
Stars: ✭ 157 (+220.41%)
Mutual labels:  chinese
pbrtbook
pbrt 中文整合翻译 基于物理的渲染:从理论到实现 Physically Based Rendering: From Theory To Implementation
Stars: ✭ 221 (+351.02%)
Mutual labels:  chinese
ODSQA
ODSQA: OPEN-DOMAIN SPOKEN QUESTION ANSWERING DATASET
Stars: ✭ 43 (-12.24%)
Mutual labels:  chinese
fishing-funds
基金,大盘,股票,虚拟货币状态栏显示小应用,基于Electron开发,支持MacOS,Windows,Linux客户端,数据源来自天天基金,蚂蚁基金,爱基金,腾讯证券,新浪基金等
Stars: ✭ 424 (+765.31%)
Mutual labels:  chinese
wasm-cn
[翻译中] WebAssembly 中文文档
Stars: ✭ 22 (-55.1%)
Mutual labels:  chinese
FairAI
This is a collection of papers and other resources related to fairness.
Stars: ✭ 55 (+12.24%)
Mutual labels:  ijcai
LEDs-single-gpu-passthrough
Single GPU passthrough guide 单显卡直通教程资源
Stars: ✭ 87 (+77.55%)
Mutual labels:  chinese
seqgan
用seqgan训练生成小黄鸡语料
Stars: ✭ 33 (-32.65%)
Mutual labels:  chinese
tudien
Từ điển tiếng Việt dành cho Kindle
Stars: ✭ 38 (-22.45%)
Mutual labels:  chinese
kula
Lightweight and highly extensible .NET scripting language.
Stars: ✭ 43 (-12.24%)
Mutual labels:  chinese
alfred-chinese-converter
支持 OpenCC 簡繁體中文詞彙級別轉換、異體字轉換以及地區習慣用詞轉換的 Alfred 2 workflow
Stars: ✭ 42 (-14.29%)
Mutual labels:  chinese
syng
A free, open source, cross-platform, Chinese-To-English dictionary for desktops.
Stars: ✭ 108 (+120.41%)
Mutual labels:  chinese
Functional-Light-JS-Zh
《Functional-Light-JS》中文翻译
Stars: ✭ 14 (-71.43%)
Mutual labels:  chinese
hzk-pixel-font
中文像素字体,12 和 16 像素。
Stars: ✭ 14 (-71.43%)
Mutual labels:  chinese
ChineseFonts
Convert asian text to web fonts
Stars: ✭ 14 (-71.43%)
Mutual labels:  chinese
kaldi-timit-sre-ivector
Develop speaker recognition model based on i-vector using TIMIT database
Stars: ✭ 17 (-65.31%)
Mutual labels:  chinese

WMPoetry

The source code for Chinese Poetry Generation with a Working Memory Model (IJCAI 2018). More related resources are available at THUAIPoet.

0. Notice

  • We update the environment from python 2.7 & Tensorflow 1.4 to python 3.6.5 & Tensorflow 1.10.
  • The source code has been reconstructed in a better coding style.
  • We also improve several implementation details.

1. Rights

All rights reserved.

2. Requirements

  • python==3.6.5
  • TensorFlow==1.10

A pytorch version of our model is available here.

3. Data Preparations

To train the model and generate poems, we provide some necessary data files as follows:

  • A rhyme dictionary. We use cilinzhengyun (《词林正韵》) instead of pingshuiyun (《平水韵》).
  • The stop words files.
  • A tf-idf file, which contains pre-calculated tf-idf values.
  • The Ping (level) tone dictionary and Ze (oblique) tone dictionary.
  • A human-checked high-quality words file.
  • A genre pattern file for quatrains.

We also provide a small corpus with 25,000 Chinese quatrains for testing this code.

All these data files are avaliable here.

You can also use your own data.

4. Preprocessing

4.1. Word Segmentation

At first, one needs to move all the downloaded files to WMPoetry/preprocess/data/, then to segment the corpus with any released segmentation tool. Then save the poems with whitespace separating words and with '|' separating sentences. The segmented corpus file should look like:

blockchain

In this file, each line is a poem. The provided small corpus has been segmented with our own poetry segmentation tool.

4.2. Keywords Extraction and Genre Pattern Building

At first, one needs to manually divide the whole corpus file into training file, validation file and testing file. For example, we use 2,3000 poem in our small corpus as the training file, train.txt and 1,000 as the validation file, valid.txt and 1,000 as the testing file, test.txt.

We provide a script to extract keywords and build genre patterns, only for Chinese quatrains. For other genres, such as lyrics and Song iambics, the scripts will be released in the future.

Put the segmented corpus file (e.g., train.txt, valid.txt and test.txt) into WMPoetry/preprocess/, then in WMPoetry/preprocess/, run:

python preprocess.py --inp valid.txt --out valid_keys.txt
python preprocess.py --inp test.txt --out test_keys.txt
python preprocess.py --inp train.txt --out train_keys.txt --cl 1

One can get the processed files: train_keys.txt, valid_keys.txt and test_keys.txt.

NOTE: By running preprocess.py on the training file (train.txt), one can also get a file, DuplicateCheckLib.txt (by setting --cl to 1), which contains all different lines in the training set. When generating poems, we will remove the generated candidates which are already in DuplicateCheckLib.txt. This DuplicateCheckLib.txt is also used to build the dictionary.

4.3. Binarization

If there isn't pre-trained word embedding or corresponding dictionary files, please first build the dictionary in WMPoetry/preprocess/, by:

python build_dic.py -i DuplicateCheckLib.txt -m 3

and one can get the dictionary file and inverting dictionary file, vocab.pickle and ivocab.pickle. We only keep the characters which occur more than -m times.

Then, binarize training data and validation data:

python binarize.py -i valid_keys.txt -b valid.pickle -d vocab.pickle
python binarize.py -i train_keys.txt -b train.pickle -d vocab.pickle

4.4. Before Training and Generation

Before training and generation, please:

  1. move test_keys.txt to WMPoetry/wm/;
  2. move train.pickle, valid.pickle, vocab.pickle and ivocab.pickle to WMPoetry/wm/train/;
  3. move pingsheng.txt, zesheng.txt and cilinList.txt to WMPoetry/wm/other/;
  4. move DuplicateCheckLib.txt, GenrePatterns.txt and fchars.txt to WMPoetry/wm/other/.

5. Training

In WMPoetry/wm, please edit the config.py at first to set the configuration, such as hidden size, embedding size, data path, model path, GPU and so on. By default, all data files are saved in WMPoetry/wm/data, and the model files & the pre-trained model files (checkpoints) are saved in WMPoetry/wm/model/ & WMPoetry/wm/premodel/.

5.1. Pre-Training

We recommend one to pre-train the encoder, decoder and embeddings by training a simple sequence-to-sequence model, which would stabilize the training of the working memory model. In WMPoetry/wm, run:

python pretrain.py

Some training information is outputed as:

blockchain

One can also check the saved training information in trainlog.txt. By default, the pre-trained model files are stored in WMPoetry/wm/premodel/.

5.2. Training

In WMPoetry/wm, run:

python train.py

The model will load pre-trained parameters of the encoder, decoder and embeddings from the checkpoints in the premodel path. If one doesn't need pre-training, please skip the pre-training step and set:

self.__use_pretrain = False

in the 20-th line of train.py and then directly run train.py.

During the training process, some training information is outputed as:

blockchain

6. Generation

We provide two interfaces of poetry generation.

The first one is an interactive interface. In WMPoetry/wm, run:

python gen_ui.py -t single -b 20

Then one can input keywords, select the genre pattern and rhyme:

blockchain

One can also set a specific checkpoint as:

python gen_ui.py -t single -b 20 -m model/poem.ckpt_4-5988

The second interface is to generate poems according to the whole testing file:

python gen_ui.py -t file -b 20 -i test_keys.txt -o output.txt

7. System

This work has been integrated into the automatic poetry generation system, ** THUAIPoet(Jiuge, 九歌)**, which is available via https://jiuge.thunlp.cn. This system is developed by Research Center for Natural Language Processing, Computational Humanities and Social Sciences, Tsinghua University (清华大学人工智能研究院,自然语言处理与社会人文计算研究中心). Please refer to THUAIPoet, THUNLP and THUNLP Lab for more information.

8. Cite

If you use our code, please kindly cite this paper:

Xiaoyuan Yi, Maosong Sun, Ruoyu Li and Zonghan Yang. Chinese Poetry Generation with a Working Memory Model. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, pages 4553–4559, Stockholm, Sweden, 2018.

The bib format is as follows:

@inproceedings{Yimemory:18,
    author  = {Xiaoyuan Yi and Maosong Sun and Ruoyu Li and Zonghan Yang},
    title   = {Chinese Poetry Generation with a Working Memory Mode},
    year    = "2018",
    pages   = "4553--4559",
    booktitle = {Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence},
    address = {Stockholm, Sweden}
}

9. Contact

If you have any questions, suggestions or bug reports, please feel free to email [email protected] or [email protected].

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].