All Projects → Georgetown-IR-Lab → Extendedsumm

Georgetown-IR-Lab / Extendedsumm

On Generating Extended Summaries of Long Documents

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Extendedsumm

Producttitlesummarizationcorpus
Dataset for CIKM 2018 paper "Multi-Source Pointer Network for Product Title Summarization"
Stars: ✭ 61 (-3.17%)
Mutual labels:  dataset, text-summarization
Dream
DREAM: A Challenge Dataset and Models for Dialogue-Based Reading Comprehension
Stars: ✭ 60 (-4.76%)
Mutual labels:  dataset
Knyfe
knyfe is a python utility for rapid exploration of datasets.
Stars: ✭ 54 (-14.29%)
Mutual labels:  dataset
View Finding Network
A deep ranking network that learns to find good compositions in a photograph.
Stars: ✭ 57 (-9.52%)
Mutual labels:  dataset
Quandl Python
Stars: ✭ 1,076 (+1607.94%)
Mutual labels:  dataset
Geodata Br
Free open public domain geographic data of Brazil available in multiple languages and formats.
Stars: ✭ 57 (-9.52%)
Mutual labels:  dataset
Covid 19
Novel Coronavirus 2019 time series data on cases
Stars: ✭ 1,060 (+1582.54%)
Mutual labels:  dataset
Wikipedia ner
📖 Labeled examples from wiki dumps in Python
Stars: ✭ 61 (-3.17%)
Mutual labels:  dataset
Maskrcnn Modanet
A Mask R-CNN Keras implementation with Modanet annotations on the Paperdoll dataset
Stars: ✭ 59 (-6.35%)
Mutual labels:  dataset
City Scapes Script
Download City Scapes Dataset using this script
Stars: ✭ 57 (-9.52%)
Mutual labels:  dataset
Cinemanet
Stars: ✭ 57 (-9.52%)
Mutual labels:  dataset
Clothing Detection Dataset
Clothing detection dataset
Stars: ✭ 55 (-12.7%)
Mutual labels:  dataset
Stevens Vlp16 Dataset
This dataset is captured using a Velodyne VLP-16, which is mounted on an UGV - Clearpath Jackal, on Stevens Institute of Technology campus
Stars: ✭ 58 (-7.94%)
Mutual labels:  dataset
Coarij
Corpus of Annual Reports in Japan
Stars: ✭ 55 (-12.7%)
Mutual labels:  dataset
Pysgs
📈 Python interface for the Brazilian Central Bank's Time Series Management System (SGS)
Stars: ✭ 60 (-4.76%)
Mutual labels:  dataset
Codar
✅ CODAR is a Framework built using PyTorch to analyze post (Text+Media) and predict Cyber Bullying and offensive content. 💬📷
Stars: ✭ 52 (-17.46%)
Mutual labels:  dataset
Animegan
A simple PyTorch Implementation of Generative Adversarial Networks, focusing on anime face drawing.
Stars: ✭ 1,095 (+1638.1%)
Mutual labels:  dataset
Legislator
Interface to the Comparative Legislators Database
Stars: ✭ 62 (-1.59%)
Mutual labels:  dataset
Covidnet Ct
COVID-Net Open Source Initiative - Models and Data for COVID-19 Detection in Chest CT
Stars: ✭ 57 (-9.52%)
Mutual labels:  dataset
Char Rnn Tensorflow
Multi-layer Recurrent Neural Networks for character-level language models implements by TensorFlow
Stars: ✭ 58 (-7.94%)
Mutual labels:  dataset

ExtendedSumm

This repository contains the implementation details and datasets used in On Generating Extended Summaries of Long Documents paper at the AAAI-21 Workshop on Scientific Document Understanding (SDU 2021).

Conda environment: preliminary setup

To install the required packages, please run conda yml file that you find in the root directory using the following command:

conda env create -f environment.yml

How to run...

IMPORTANT: The following commands should be run under src/ directory.

Dataset

To start with, you first need to download the datasets that are intended to work with the code base. You can download them from following links:

Dataset Download Link
arXiv-Long Download
PubMed-Long Download

After downloading the dataset, you will need to uncompress it using the following command:

tar -xvf pubmedL.tar.gz 

This will uncompress the pubmedL tar file into the current directory. The directory will include the single json files of different sets including training, validation, and test.

FORMAT Each paper file is structured within a a json object with the following keys:

  • "id" (String): the paper ID
  • "abstract" (String): the abstract text of the paper. This field is different from "gold" field for the datasets that have different ground-truth than the abstract.
  • "gold" (List <List<>>): the ground-truth summary of the paper, where the inner list is the tokens associated with each gold summary sentence.
  • "sentences" (List <List<>>): the source sentences of the full-text. The inner list contains 5 indices, each of which represents different fields of the source sentence:
    • Index [0]: tokens of the sentences (i.e., list of tokens).
    • Index [1]: textual representation of the section that the sentence belongs to.
    • Index [2]: Rouge-L score of the sentence with the gold summary.
    • Index [3]: textual representation of the sentences.
    • Index [4]: oracle label associated with the sentence (0, or 1).
    • Index [5]: the section id assigned by sequential sentence classification package. For more information, please refer to this repository

Preparing Data

Simply run the prep.sh bash script with providing the dataset directory. This script will use two functions to first create aggregated json files, and then preparing them for pretrained language models' usage.

Please note that if you want to use your custom dataset and create torch files, you will need to frame the format of your dataset to the given format in the Dataset section.

Training

The full training scripts are inside train.sh bash file. To run it on your machine, you will need to change the directories to fit in your needs:

...

DATA_PATH=/path/to/dataset/torch-files/
MODEL_PATH=/path/to/saved/model/

# Specifiying GPUs either single GPU, or multi-GPU
export CUDA_VISIBLE_DEVICES=0,1


# You don't need to modify these below 
LOG_DIR=../logs/$(echo $MODEL_PATH | cut -d \/ -f 6).log
mkdir -p ../results/$(echo $MODEL_PATH | cut -d \/ -f 6)
RESULT_PATH_TEST=../results/$(echo $MODEL_PATH | cut -d \/ -f 6)/

MAX_POS=2500

...

Inference

The inference scripts are inside test.sh bash file. To run it on your machine, you will need to modify the file directories:

...
# path to the data directory
BERT_DIR=/path/to/dataset/torch-files/

# path to the trained model directory
MODEL_PATH=/disk1/sajad/sci-trained-models/presum/LSUM-2500-segmented-sectioned-multi50-classi-v1/

# path to the best trained model (or the checkpoint that you want to run inference on)
CHECKPOINT=$MODEL_PATH/Recall_BEST_model_s63000_0.4910.pt

# GPU machines, either multi or single GPU
export CUDA_VISIBLE_DEVICES=0,1

MAX_POS=2500

...

Citation

If you plan to use this work, please cite the following papers:

@inproceedings{Sotudeh2021ExtendedSumm,
  title={On Generating Extended Summaries of Long Documents},
  author={Sajad Sotudeh and Arman Cohan and Nazli Goharian},
  booktitle={The AAAI-21 Workshop on Scientific Document Understanding (SDU 2021)},
  year={2021}
}
@inproceedings{Sotudeh2020LongSumm,
  title={GUIR @ LongSumm 2020: Learning to Generate Long Summaries from Scientific Documents},
  author={Sajad Sotudeh and Arman Cohan and Nazli Goharian},
  booktitle={First Workshop on Scholarly Document Processing (SDP 2020)},
  year={2020}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].