All Projects → ppasupat → WikiTableQuestions

ppasupat / WikiTableQuestions

Licence: CC-BY-SA-4.0 license
A dataset of complex questions on semi-structured Wikipedia tables

Programming Languages

HTML
75241 projects

Projects that are alternatives of or similar to WikiTableQuestions

Compositional-Generalization-in-Natural-Language-Processing
Compositional Generalization in Natual Language Processing. A roadmap.
Stars: ✭ 26 (-67.9%)
Mutual labels:  compositional-semantics, semantic-parsing
KrantikariQA
An InformationGain based Question Answering over knowledge Graph system.
Stars: ✭ 54 (-33.33%)
Mutual labels:  question-answering
question-answering
No description or website provided.
Stars: ✭ 32 (-60.49%)
Mutual labels:  question-answering
PersianQA
Persian (Farsi) Question Answering Dataset (+ Models)
Stars: ✭ 114 (+40.74%)
Mutual labels:  question-answering
patrick-wechat
⭐️🐟 questionnaire wechat app built with taro, taro-ui and heart. 微信问卷小程序
Stars: ✭ 74 (-8.64%)
Mutual labels:  question-answering
deformer
[ACL 2020] DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering
Stars: ✭ 111 (+37.04%)
Mutual labels:  question-answering
weak-supervised-Rule-Text2SQL
Using Database Rule for Weak Supervised Text-to-SQL Generation
Stars: ✭ 13 (-83.95%)
Mutual labels:  semantic-parsing
Question-Answering
Question Answering over Knowledge Bases
Stars: ✭ 24 (-70.37%)
Mutual labels:  semantic-parsing
mrqa
Code for EMNLP-IJCNLP 2019 MRQA Workshop Paper: "Domain-agnostic Question-Answering with Adversarial Training"
Stars: ✭ 35 (-56.79%)
Mutual labels:  question-answering
denspi
Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index (DenSPI)
Stars: ✭ 188 (+132.1%)
Mutual labels:  question-answering
QA4IE
Original implementation of QA4IE
Stars: ✭ 24 (-70.37%)
Mutual labels:  question-answering
iPerceive
Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering | Python3 | PyTorch | CNNs | Causality | Reasoning | LSTMs | Transformers | Multi-Head Self Attention | Published in IEEE Winter Conference on Applications of Computer Vision (WACV) 2021
Stars: ✭ 52 (-35.8%)
Mutual labels:  question-answering
PororoQA
PororoQA, https://arxiv.org/abs/1707.00836
Stars: ✭ 26 (-67.9%)
Mutual labels:  question-answering
MLH-Quizzet
This is a smart Quiz Generator that generates a dynamic quiz from any uploaded text/PDF document using NLP. This can be used for self-analysis, question paper generation, and evaluation, thus reducing human effort.
Stars: ✭ 23 (-71.6%)
Mutual labels:  question-answering
unanswerable qa
The official implementation for ACL 2021 "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval".
Stars: ✭ 21 (-74.07%)
Mutual labels:  question-answering
ODSQA
ODSQA: OPEN-DOMAIN SPOKEN QUESTION ANSWERING DATASET
Stars: ✭ 43 (-46.91%)
Mutual labels:  question-answering
text2sql-lgesql
This is the project containing source codes and pre-trained models about ACL2021 Long Paper ``LGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local Relations".
Stars: ✭ 68 (-16.05%)
Mutual labels:  semantic-parsing
finance-qa-spider
金融问答平台文本数据采集/爬取,数据源涉及上交所,深交所,全景网及新浪股吧
Stars: ✭ 33 (-59.26%)
Mutual labels:  question-answering
Medi-CoQA
Conversational Question Answering on Clinical Text
Stars: ✭ 22 (-72.84%)
Mutual labels:  question-answering
head-qa
HEAD-QA: A Healthcare Dataset for Complex Reasoning
Stars: ✭ 20 (-75.31%)
Mutual labels:  question-answering

WikiTableQuestions Dataset

Version 1.0.2 (October 4, 2016)

Introduction

The WikiTableQuestions dataset is for the task of question answering on semi-structured HTML tables as presented in the paper:

Panupong Pasupat, Percy Liang.
Compositional Semantic Parsing on Semi-Structured Tables
Association for Computational Linguistics (ACL), 2015.

More details about the project: https://nlp.stanford.edu/software/sempre/wikitable/

TSV Format

Many files in this dataset are stored as tab-separated values (TSV) with the following special constructs:

  • List items are separated by | (e.g., when|was|taylor|swift|born|?).

  • The following characters are escaped: newline (=> \n), backslash (\ => \\), and pipe (| => \p) Note that pipes become \p so that doing x.split('|') will work.

  • Consecutive whitespaces (except newlines) are collapsed into a single space.

Questions and Answers

The data/ directory contains the questions, answers, and the ID of the tables that the questions are asking about.

Each portion of the dataset is stored as a TSV file where each line contains one example.

Field descriptions:

  • id: unique ID of the example
  • utterance: the question in its original format
  • context: the table used to answer the question
  • targetValue: the answer, possibly a |-separated list

Dataset Splits: We split 22033 examples into multiple sets:

  • training: Training data (14152 examples)

  • pristine-unseen-tables: Test data -- the tables are not seen in training data (4344 examples)

  • pristine-seen-tables: Additional data where the tables are seen in training data. (3537 examples) (Initially intended to be used as development data, this portion of the dataset has not been used in any experiment in the paper.)

  • random-split-*: For development, we split training.tsv into random 80-20 splits. Within each split, tables in the training data (random-split-seed-*-train) and the test data (random-split-seed-*-test) are disjoint.

  • training-before300: The first 300 training examples.

  • annotated-all.examples: The first 300 training examples annotated with gold logical forms.

For our ACL 2015 paper:

  • In development set experiments: we trained on random-split-seed-{1,2,3}-train and tested on random-split-seed-{1,2,3}-test, respectively.

  • In test set experiments: we trained on training and tested on pristine-unseen-tables.

Supplementary Files:

  • *.examples files: The LispTree format of the dataset is used internally in our SEMPRE code base. The *.examples files contain the same information as the TSV files.

Tables

The csv/ directory contains the extracted tables, while the page/ directory contains the raw HTML data of the whole web page.

Table Formats:

  • csv/xxx-csv/yyy.csv: Comma-separated table (The first row is treated as the column header) The escaped characters include: double quote (" => \") and backslash (\ => \\). Newlines are represented as quoted line breaks.

  • csv/xxx-csv/yyy.tsv: Tab-separated table. The TSV escapes explained at the beginning are used.

  • csv/xxx-csv/yyy.table: Human-readable column-aligned table. Some information was loss during data conversion, so this format should not be used as an input.

  • csv/xxx-csv/yyy.html: Formatted HTML of just the table

  • page/xxx-page/yyy.html: Raw HTML of the whole web page

  • page/xxx-page/yyy.json: Metadata including the URL, the page title, and the index of the chosen table. (Only tables with the wikitable class are considered.)

The conversion from HTML to CSV and TSV was done using table-to-tsv.py. Its dependency is in the weblib/ directory.

CoreNLP Tagged Files

Questions and tables are tagged using CoreNLP 3.5.2. The annotation is not perfect (e.g., it cannot detect the date "13-12-1989"), but it is usually good enough.

  • tagged/data/*.tagged: Tagged questions. Each line contains one example.

    Field descriptions:

    • id: unique ID of the example
    • utterance: the question in its original format
    • context: the table used to answer the question
    • targetValue: the answer, possibly a |-separated list
    • tokens: the question, tokenized
    • lemmaTokens: the question, tokenized and lemmatized
    • posTags: the part of speech tag of each token
    • nerTags: the name entity tag of each token
    • nerValues: if the NER tag is numerical or temporal, the value of that NER span will be listed here
    • targetCanon: canonical form of the answers where numbers and dates are converted into normalized values
    • targetCanonType: type of the canonical answers; possible values include "number", "date", "string", and "mixed"
  • tagged/xxx-tagged/yyy.tagged: Tab-separated file containing the CoreNLP annotation of each table cell. Each line represents one table cell.

    Mandatory fields:

    • row: row index (-1 is the header row)
    • col: column index
    • id: unique ID of the cell.
      • Each header cell gets a unique ID even when the contents are identical
      • Non-header cells get the same ID if they have exactly the same content
    • content: the cell text (images and hidden spans are removed)
    • tokens: the cell text, tokenized
    • lemmaTokens: the cell text, tokenized and lemmatized
    • posTags: the part of speech tag of each token
    • nerTags: the name entity tag of each token
    • nerValues: if the NER tag is numerical or temporal, the value of that NER span will be listed here

    The following fields are optional:

    • number: interpretation as a number (for multiple numbers, the first number is extracted)
    • date: interpretation as a date
    • num2: the second number in the cell (useful for scores like 1-2)
    • list: interpretation as a list of items

    Header cells do not have these optional fields.

Evaluator

evaluator.py is the official evaluator.

Usage: evaluator.py <tagged_dataset_path> <prediction_path>

  • tagged_dataset_path should be a dataset .tagged file containing the relevant examples

  • prediction_path should contain predictions from the model. Each line should contain ex_id item1 item2 ... If the model does not produce a prediction, just output ex_id without the items.

Note that the resulting scores will be different from what SEMPRE produces as SEMPRE also enforces the prediction to have the same type as the target value, while the official evaluator is more lenient.

Version History

1.0 - Fixed various bugs in datasets (encoding issues, number normalization issues)

0.5 - Added evaluator

0.4 - Added annotated logical forms of the first 300 examples / Renamed CoreNLP tagged data as tagged to avoid confusion

0.3 - Repaired table headers / Added raw HTML tables / Added CoreNLP tagged data

0.2 - Initial release

For questions and comments, please contact Ice Pasupat [email protected]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].