All Projects β†’ malllabiisc β†’ SGCP

malllabiisc / SGCP

Licence: Apache-2.0 license
TACL 2020: Syntax-Guided Controlled Generation of Paraphrases

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to SGCP

Dips
NAACL 2019: Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation
Stars: ✭ 59 (-11.94%)
Mutual labels:  paper, natural-language-generation
chatbot-samples
πŸ€– θŠε€©ζœΊε™¨δΊΊοΌŒε―Ήθ―ζ¨‘ζΏ
Stars: ✭ 110 (+64.18%)
Mutual labels:  natural-language-generation
triumph-gui
Simple lib to create inventory GUIs for Bukkit platforms.
Stars: ✭ 196 (+192.54%)
Mutual labels:  paper
jpeg-defense
SHIELD: Fast, Practical Defense and Vaccination for Deep Learning using JPEG Compression
Stars: ✭ 82 (+22.39%)
Mutual labels:  paper
Court-View-Gen
Interpretable Charge Predictions for Criminal Cases: Learning to Generate Court Views from Fact Descriptions
Stars: ✭ 13 (-80.6%)
Mutual labels:  natural-language-generation
DynamicEntitySummarization-DynES
Dynamic Entity Summarization (DynES)
Stars: ✭ 21 (-68.66%)
Mutual labels:  paper
Luci
Logical Unity for Communicational Interactivity
Stars: ✭ 25 (-62.69%)
Mutual labels:  natural-language-generation
awesome-nlg
A curated list of resources dedicated to Natural Language Generation (NLG)
Stars: ✭ 386 (+476.12%)
Mutual labels:  natural-language-generation
Mirai
Mirai ζœͺζ₯ - A powerful Minecraft Server Software coming from the future
Stars: ✭ 325 (+385.07%)
Mutual labels:  paper
SciDownl
An unofficial api for downloading papers from SciHub via DOI, PMID
Stars: ✭ 103 (+53.73%)
Mutual labels:  paper
paper
ReScript bindings for react-native-paper
Stars: ✭ 14 (-79.1%)
Mutual labels:  paper
deep-atrous-guided-filter
Deep Atrous Guided Filter for Image Restoration in Under Display Cameras (UDC Challenge, ECCV 2020).
Stars: ✭ 32 (-52.24%)
Mutual labels:  paper
resources
No description or website provided.
Stars: ✭ 14 (-79.1%)
Mutual labels:  paper
Awesome-Human-Activity-Recognition
An up-to-date & curated list of Awesome IMU-based Human Activity Recognition(Ubiquitous Computing) papers, methods & resources. Please note that most of the collections of researches are mainly based on IMU data.
Stars: ✭ 72 (+7.46%)
Mutual labels:  paper
External-Attention-pytorch
πŸ€ Pytorch implementation of various Attention Mechanisms, MLP, Re-parameter, Convolution, which is helpful to further understand papers.⭐⭐⭐
Stars: ✭ 7,344 (+10861.19%)
Mutual labels:  paper
DCGCN
Densely Connected Graph Convolutional Networks for Graph-to-Sequence Learning (authors' MXNet implementation for the TACL19 paper)
Stars: ✭ 73 (+8.96%)
Mutual labels:  natural-language-generation
AdvPC
AdvPC: Transferable Adversarial Perturbations on 3D Point Clouds (ECCV 2020)
Stars: ✭ 35 (-47.76%)
Mutual labels:  paper
question generator
An NLP system for generating reading comprehension questions
Stars: ✭ 188 (+180.6%)
Mutual labels:  natural-language-generation
CURL
Code for the ICPR 2020 paper: "CURL: Neural Curve Layers for Image Enhancement"
Stars: ✭ 177 (+164.18%)
Mutual labels:  paper
rtg
Reader Translator Generator - NMT toolkit based on pytorch
Stars: ✭ 26 (-61.19%)
Mutual labels:  natural-language-generation

Syntax-Guided Controlled Generation of Paraphrases

Source code for TACL 2020 paper: Syntax-Guided Controlled Generation of Paraphrases

Image

  • Overview: Architecture of SGCP (proposed method). SGCP aims to paraphrase an input sentence, while conforming to the syntax of an exemplar sentence (provided along with the input). The input sentence is encoded using the Sentence Encoder to obtain a semantic signal ct . The Syntactic Encoder takes a constituency parse tree (pruned at height H) of the exemplar sentence as an input, and produces representations for all the nodes in the pruned tree. Once both of these are encoded, the Syntactic Paraphrase Decoder uses pointer-generator network, and at each time step takes the semantic signal ct , the decoder recurrent state st , embedding of the previous token and syntactic signal hYt to generate a new token. Note that the syntactic signal remains the same for each token in a span (shown in figure above curly braces). The gray shaded region (not part of the model) illustrates a qualitative comparison of the exemplar syntax tree and the syntax tree obtained from the generated paraphrase.

Dependencies

  • Compatible with Pytorch 1.3.0 and Python 3.x
  • The necessary packages can be install through requirements.txt

Setup

To get the project's source code, clone the github repository:

$ git clone https://github.com/malllabiisc/SGCP

Install VirtualEnv using the following (optional):

$ [sudo] pip install virtualenv

Create and activate your virtual environment (optional):

$ virtualenv -p python3 venv
$ source venv/bin/activate

Install all the required packages:

$ pip install -r requirements.txt

Create essential folders in the repository using:

$ chmod a+x setup.sh
$ ./setup.sh

Resources

Dataset

  • Download the following dataset(s): Data
  • Extract and place them in the SGCP/data directory

Path: SGCP/data/<dataset-folder-name>.

A sample dataset folder might look like

data/QQPPos/<train/test/val>/<src.txt/tgt.txt/refs.txt/src.txt-corenlp-opti/tgt.txt-corenlp-opti/refs.txt-corenlp-opti>

Pre-trained Models:

  • Download the following pre-trained models for both QQPPos and ParaNMT50m datasets: Models
  • Extract and place them in the SGCP/Models directory

Path: SGCP/Models/<dataset_Models>

Evaluation Essentials

  • Download the evaluation file: evaluation
  • Extract and place it in SGCP/src/evaluation directory
  • Give executable permissions to SGCP/src/evaluation/apps/multi-bleu.perl

Path: SGCP/src/evaluation/<apps/data/ParaphraseDetection>

This contains all the necessary files needed to evaluate the model. It also contains the Paraphrase Detection Score Models for Model-based evaluation.

Training the model

  • For training the model with default hyperparameter settings, execute the following command:

    python -m src.main -mode train -run_name testrun -dataset <DatasetName> -gpu <GPU-ID> -bpe
    
    • -run_name: To specify the name of the run for storing model parameters
    • -dataset: Which dataset to train the model on, choose from QQPPos and ParaNMT50m
    • -gpu: For a multi-GPU machine, specify the id of the gpu where you wish to run the code. For a single GPU machine simply use 0 as the ID
    • -bpe: To enable byte-pair encoding for tokenizing data.
  • Other hyperparameters can be viewed in src/args.py

Generation and Evaluation

  • For generating paraphrases on the QQPPos dataset, execute the following command:

    python -m src.main -mode decode -dataset QQPPos -run_name QQP_Models -gpu <gpu-num> -beam_width 10 -max_length 60 -res_file generations.txt
    
  • Similarly for ParaNMT dataset:

    python -m src.main -mode decode -dataset ParaNMT50m -run_name ParaNMT_Models -gpu <gpu-num> -beam_width 10 -max_length 60 -res_file generations.txt
    
  • To evaluate BLEU, ROUGE, METEOR, TED and Prec. scores, first clean the generations:

    • For QQPPos:
    python -m src.utils.clean_generations -gen_dir Generations/QQP_Models -data_dir data/QQPPos/test
    -gen_file generations.txt
    
    • For ParaNMT50m
    python -m src.utils.clean_generations -gen_dir Generations/ParaNMT_Models -data_dir data/ParaNMT50m/test
    -gen_file generations.txt
    
  • Since our model generates multiple paraphrases corresponding to different heights of the syntax tree, to select a single generation:

    python -m src.utils.candidate_selection -gen_dir Generations/QQP_Models
    -clean_gen_file clean_generations.csv -res_file final_paraphrases.txt -crt <SELECTION CRITERIA>
    
    • -crt: Criteria to use for selecting a single generation from the given candidates. Choose 'rouge' for ROUGE based selection as given in paper (SGCP-R) and 'maxht' for selecting the generation corresponding to the full height of the tree (SGCP-F)
  • Finally, to obtain the scores, run:

    • For QQPPos:
    python -m src.evaluation.eval -i Generations/QQP_Models/final_paraphrases.txt
    -r data/QQPPos/test/ref.txt -t data/QQPPos/test/tgt.txt
    
    • For ParaNMT50m:
    python -m src.evaluation.eval -i Generations/ParaNMT_Models/final_paraphrases.txt
    -r data/ParaNMT50m/test/ref.txt -t data/ParaNMT50m/test/tgt.txt
    

Custom Dataset Processing

Preprocess and parse the data using the following steps.

  1. Move the contents of your custom dataset in the data/ directory, with files arranged something like this:

    • data
      • Custom_Dataset
        • train
          • src.txt
          • tgt.txt
        • val
          • src.txt
          • tgt.txt
          • ref.txt
        • test
          • src.txt
          • tgt.txt
          • ref.txt

    Here, src.txt contains the source sentences, tgt.txt contains exemplars and ref.txt contains the paraphrases.

  2. Construct a byte-pair codes file which will be used to generate byte pair encodings of the dataset. From the main directory of this repo, run: subword-nmt learn-bpe <data/Custom_Dataset/train/src.txt> data/Custom_Dataset/train/codes.txt Note: [Optional] Generate codes from both src.txt and tgt.txt - For that first concatenate the two files and replace src.txt with the name of the concatenated file in the command.

  3. Parse the data files using stanford corenlp. First start a corenlp server by executing the following commands:

cd src/evaluation/apps/stanford-corenlp-full-2018-10-05
java -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -preload tokenize,ssplit,pos,lemma,ner,parse -parse.model /edu/stanford/nlp/models/srparser/englishSR.ser.gz -status_port <PORT_NUMBER> -port <PORT_NUMBER> -timeout 15000
  1. Finally run the parser on the text files.
cd <PATH_TO_THIS_REPO>
python -m src.utils.con_parser -infile data/Custom_Dataset/train/src.txt -codefile data/Custom_Dataset/train/codes.txt -port <PORT_NUMBER (where the corenlp server is running, from step 3)> -host localhost

This will generate a file in train folder called src.txt-corenlp-opti Run this for all other files i.e. tgt.txt in train folder, src.txt, tgt.txt, ref.txt in val folder and similarly for the files in test folder.

Citing:

Please cite the following paper if you use this code in your work.

@article{sgcp2020,
author = {Kumar, Ashutosh and Ahuja, Kabir and Vadapalli, Raghuram and Talukdar, Partha},
title = {Syntax-Guided Controlled Generation of Paraphrases},
journal = {Transactions of the Association for Computational Linguistics},
volume = {8},
number = {},
pages = {330-345},
year = {2020},
doi = {10.1162/tacl\_a\_00318},
URL = { https://doi.org/10.1162/tacl_a_00318 },
eprint = { https://doi.org/10.1162/tacl_a_00318 }
}

For any clarification, comments, or suggestions please create an issue or contact [email protected]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].