All Projects → Yale-LILY → dart

Yale-LILY / dart

Licence: MIT license
No description, website, or topics provided.

Programming Languages

Lex
420 projects
java
68154 projects - #9 most used programming language
python
139335 projects - #7 most used programming language
perl
6916 projects
Jupyter Notebook
11667 projects
shell
77523 projects

DART: Open-Domain Structured Data Record to Text Generation

DART is a large and open-domain structured DAta Record to Text generation corpus with high-quality sentence annotations with each input being a set of entity-relation triples following a tree-structured ontology. It consists of 82191 examples across different domains with each input being a semantic triple set derived from data records in tables and the tree ontology of table schema, annotated with sentence description that covers all facts in the triple set.

DART is released in the following paper where you can find more details and baseline results.

Citation

@article{nan2021dart,
  title={DART: Open-Domain Structured Data Record to Text Generation},
  author={Linyong Nan and Dragomir Radev and Rui Zhang and Amrit Rau and Abhinand Sivaprasad and Chiachun Hsieh and Xiangru Tang and Aadit Vyas and Neha Verma and Pranav Krishna and Yangxiaokang Liu and Nadia Irwanto and Jessica Pan and Faiaz Rahman and Ahmad Zaidi and Murori Mutuma and Yasin Tarabar and Ankit Gupta and Tao Yu and Yi Chern Tan and Xi Victoria Lin and Caiming Xiong and Richard Socher and Nazneen Fatema Rajani},
  journal={arXiv preprint arXiv:2007.02871},
  year={2021}

Data Content and Format

The DART dataset is available in the data/v1.1.1/ directory. The dataset consists of a JSON version and a XML version of train/dev/test files in data/.

Each JSON file contains a list of tripleset-annotation pairs of the form:

  {
    "tripleset": [
      [
        "Ben Mauk",
        "High school",
        "Kenton"
      ],
      [
        "Ben Mauk",
        "College",
        "Wake Forest Cincinnati"
      ]
    ],
    "subtree_was_extended": false,
    "annotations": [
      {
        "source": "WikiTableQuestions_lily",
        "text": "Ben Mauk, who attended Kenton High School, attended Wake Forest Cincinnati for college."
      }
    ]
  }

Each XML file contains a list tripleset-lex pairs of the form:

  <entry category="MISC" eid="Id1" size="2">
    <modifiedtripleset>
      <mtriple>Mars Hill College | JOINED | 1973</mtriple>
      <mtriple>Mars Hill College | LOCATION | Mars Hill, North Carolina</mtriple>
    </modifiedtripleset>
    <lex comment="WikiSQL_decl_sents" lid="Id1">A school from Mars Hill, North Carolina, joined in 1973.</lex>
  </entry>

You can use data/v1.1.1/select_partitions.py to generate dataset that contains different partitions of DART, and note that different partitions have different sources of annotation. Specifically we have the following sources of annotation:

  • WikiTableQuestions_lily, WikiSQL_lily ⇒ Instances that are manually annotated by internal annotators
  • WikiTableQuestions_mturk ⇒ Instances that are manually annotated by MTurk workers
  • WikiSQL_decl_sents ⇒ Instances that are automatically annotated by a procedure described in Sec 2.2 of our paper
  • webnlg, e2e ⇒ Instances obtained by converting existing datasets, these partitions are less open-domained

In addition, we provide 4 settings of generating dataset for your research purpose:

  • manual: this setting includes all manually annotated instances
  • manual_and_auto: this setting includes both manually and automatically annotated instances, but excluding webnlg and e2e which are less open-domained partitions
  • full: this setting includes all partitions of DART
  • custom: you can choose any combination of partitions

Models

We also provide implementations we use to produce results in our paper. Please refer to model/ for more information.

Leaderboard

We maintain a leaderboard on our test set.

Model BLEU METEOR TER MoverScore BERTScore BLEURT PARENT
T5-large (Raffel et al., 2020) 50.66 0.40 0.43 0.54 0.95 0.44 0.58
BART-large (Lewis et al., 2020) 48.56 0.39 0.45 0.52 0.95 0.41 0.57
Seq2Seq-Att (MELBOURNE) 29.66 0.27 0.63 0.31 0.90 -0.13 0.35
End-to-End Transformer (Castro Ferreira et al., 2019) 27.24 0.25 0.65 0.25 0.89 -0.29 0.28
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].