Code for "Bridging Subword Gaps in Pretrain-Finetune Paradigm for Natural Language Generation" (ACL 2021)

Description for directories

poattention (modified from Fairseq): Training the Position-Aware Embedding Generator for seq2seq models.
use_poattention (modified from Fairseq): Generating embeddings for unseen tokens as well as fine-tuning the seq2seq model with a vocabulary for downstream data under the downstream task.
bert_poattention (modified from Transformers): Training the Position-Aware Embedding Generator for bert-like models.
bert_use_poattention (modified from Fairseq): Generating embeddings for unseen tokens, converting parameters of bert-like model to seq2seq one, as well as fine-tuning the seq2seq model with a newly generated vocabulary under the downstream task.

For seq2seq pretrained model

Preprocess upstream and downstream data (refer to Fairseq for details). Binarized data and vocabularies will be stored in data-bin
Move the seq2seq pretrained model (generated by Fairseq) to ./checkpoints and rename it as checkpoint_last.pt.

cp path_to_pretrained_model ./checkpoints/checkpoint_last.pt
Train the embedding generator

pip install .; bash train.sh
Stop training when model tends to coverage.

Preprocess upstream and downstream data (refer to Fairseq for details). Binarized data and vocabularies will be stored in data-bin
Get the mapping between upstream and downstream vocabulary.

python get_map_index.py

Note: please change the data name in get_map_index.py
Move the well-trained embedding genearator checkpoint (generated by poattention) to ./checkpoints and rename it as checkpoint_last.pt.

cp path_to_embedding_generator ./checkpoints/checkpoint_last.pt
Generate unseen tokens and finetune the downstream model with downstream vocabulary.

pip install .; bash train.sh

For bert-like pretrained model

Preprocess upstream and downstream data (refer to Fairseq for details). Binarized data and vocabularies will be stored in data-bin.

Note: Sentences should be cutted by WordPiece, I suggest the bert-vocab-builder for building the vocabulary of downstream data.
Get the mapping between upstream and downstream vocabulary.

python get_map_index.py

Note: please change the data name in get_map_index.py
Generate unseen tokens and finetune the downstream model with downstream vocabulary.

pip install path_to_bert_poattention

pip install .; bash train.sh