All Projects → zlsh80826 → MSMARCO

zlsh80826 / MSMARCO

Licence: other
Machine Comprehension Train on MSMARCO with S-NET Extraction Modification

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to MSMARCO

ODSQA
ODSQA: OPEN-DOMAIN SPOKEN QUESTION ANSWERING DATASET
Stars: ✭ 43 (+38.71%)
Mutual labels:  question-answering, machine-comprehension
TOEFL-QA
A question answering dataset for machine comprehension of spoken content
Stars: ✭ 61 (+96.77%)
Mutual labels:  question-answering, machine-comprehension
QA4IE
Original implementation of QA4IE
Stars: ✭ 24 (-22.58%)
Mutual labels:  question-answering
squadgym
Environment that can be used to evaluate reasoning capabilities of artificial agents
Stars: ✭ 27 (-12.9%)
Mutual labels:  question-answering
mrqa
Code for EMNLP-IJCNLP 2019 MRQA Workshop Paper: "Domain-agnostic Question-Answering with Adversarial Training"
Stars: ✭ 35 (+12.9%)
Mutual labels:  question-answering
denspi
Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index (DenSPI)
Stars: ✭ 188 (+506.45%)
Mutual labels:  question-answering
unanswerable qa
The official implementation for ACL 2021 "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval".
Stars: ✭ 21 (-32.26%)
Mutual labels:  question-answering
iPerceive
Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering | Python3 | PyTorch | CNNs | Causality | Reasoning | LSTMs | Transformers | Multi-Head Self Attention | Published in IEEE Winter Conference on Applications of Computer Vision (WACV) 2021
Stars: ✭ 52 (+67.74%)
Mutual labels:  question-answering
Stargraph
StarGraph (aka *graph) is a graph database to query large Knowledge Graphs. Playing with Knowledge Graphs can be useful if you are developing AI applications or doing data analysis over complex domains.
Stars: ✭ 24 (-22.58%)
Mutual labels:  question-answering
SQUAD2.Q-Augmented-Dataset
Augmented version of SQUAD 2.0 for Questions
Stars: ✭ 31 (+0%)
Mutual labels:  question-answering
WikiTableQuestions
A dataset of complex questions on semi-structured Wikipedia tables
Stars: ✭ 81 (+161.29%)
Mutual labels:  question-answering
PororoQA
PororoQA, https://arxiv.org/abs/1707.00836
Stars: ✭ 26 (-16.13%)
Mutual labels:  question-answering
PersianQA
Persian (Farsi) Question Answering Dataset (+ Models)
Stars: ✭ 114 (+267.74%)
Mutual labels:  question-answering
head-qa
HEAD-QA: A Healthcare Dataset for Complex Reasoning
Stars: ✭ 20 (-35.48%)
Mutual labels:  question-answering
DockerKeras
We provide GPU-enabled docker images including Keras, TensorFlow, CNTK, MXNET and Theano.
Stars: ✭ 49 (+58.06%)
Mutual labels:  cntk
CNTKUnityTools
Some Deep learning tools in Unity using CNTK
Stars: ✭ 21 (-32.26%)
Mutual labels:  cntk
GAR
Code and resources for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021
Stars: ✭ 38 (+22.58%)
Mutual labels:  question-answering
deformer
[ACL 2020] DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering
Stars: ✭ 111 (+258.06%)
Mutual labels:  question-answering
KrantikariQA
An InformationGain based Question Answering over knowledge Graph system.
Stars: ✭ 54 (+74.19%)
Mutual labels:  question-answering
cherche
📑 Neural Search
Stars: ✭ 196 (+532.26%)
Mutual labels:  question-answering

MSMARCO with S-NET Extraction (Extraction-net)

Requirements

Here are some required libraries for training and evaluations.

General

  • python3.6
  • cuda-9.0 (CNTK required)
  • openmpi-1.10 (CNTK required)
  • gcc >= 6 (CNTK required)

Python

  • Please refer requirements.txt

Evaluate with pretrained model

This repo provides pretrained model and pre-processed validation dataset for testing the performance

Please download pretrained model and pre-processed data and put them on the MSMARCO/data and MSMARCO root directory respectively, then decompress them at the right places.

The code structure should be like

MSMARCO
├── data
│   ├── elmo_embedding.bin
│   ├── test.tsv
│   ├── vocabs.pkl
│   ├── data.tar.gz
│   └── ... others
├── model
│   ├── pm.model
│   ├── pm.model.ckp
│   └── pm.model_out.json
└── ... others

After decompressing,

cd Evaluation
sh eval.sh

then you should get the generated answer and rough-l score.

Usage

Preprocess

MSMARCO V1

Download MSMARCO v1 dataset, GloVe embedding.

cd data
python3.6 download.py v1

Convert raw data to tsv format

python3.6 convert_msmarco.py v1 --threads=`nproc` 

Convert tsv format to ctf(CNTK input) format and build vocabs dictionary

python3.6 tsv2ctf.py

Generate elmo embedding

sh elmo.sh

MSMARCO V2

Download MSMARCO v2 dataset, GloVe embedding.

cd data
python3.6 download.py v2

Convert raw data to tsv format

python3.6 convert_msmarco.py v2 --threads=`nproc`

Convert tsv format to ctf(CNTK input) format and build vocabs dictionary

python3.6 tsv2ctf.py

Generate elmo embedding

sh elmo.sh

Train (Same for V1 and V2)

cd ../script
mkdir log
sh run.sh

Evaluate develop dataset

MSMARCO V1

cd Evaluation
sh eval.sh v1

MSMARCO v2

cd Evaluation
sh eval.sh v2

Performance

Paper

rouge-l bleu_1
S-Net (Extraction) 41.45 44.08
S-Net (Extraction, Ensemble) 42.92 44.97

This implementation

rouge-l bleu_1
MSMARCO v1 w/o elmo 38.43 39.14
MSMARCO v1 w/ elmo 39.42 39.47
MSMARCO v2 w/ elmo 43.66 44.44

TODO

  • Multi-threads preprocessing
  • Elmo-Embedding
  • Evaluation script
  • MSMARCO v2 support
  • Reasonable metrics
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].