All Projects → google-research-datasets → query-wellformedness

google-research-datasets / query-wellformedness

Licence: other
25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.

Projects that are alternatives of or similar to query-wellformedness

Sparkler
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
Stars: ✭ 362 (+352.5%)
Mutual labels:  search-engine, information-retrieval
Relevancyfeedback
Dice.com's relevancy feedback solr plugin created by Simon Hughes (Dice). Contains request handlers for doing MLT style recommendations, conceptual search, semantic search and personalized search
Stars: ✭ 19 (-76.25%)
Mutual labels:  search-engine, information-retrieval
Lucene Solr
Apache Lucene and Solr open-source search software
Stars: ✭ 4,217 (+5171.25%)
Mutual labels:  search-engine, information-retrieval
patzilla
PatZilla is a modular patent information research platform and data integration toolkit with a modern user interface and access to multiple data sources.
Stars: ✭ 71 (-11.25%)
Mutual labels:  search-engine, information-retrieval
Sf1r Lite
Search Formula-1——A distributed high performance massive data engine for enterprise/vertical search
Stars: ✭ 158 (+97.5%)
Mutual labels:  search-engine, information-retrieval
see
Search Engine in Erlang
Stars: ✭ 27 (-66.25%)
Mutual labels:  search-engine, information-retrieval
Resin
Hardware-accelerated vector-based search engine. Available as a HTTP service or as an embedded library.
Stars: ✭ 529 (+561.25%)
Mutual labels:  search-engine, information-retrieval
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (+55%)
Mutual labels:  information-retrieval, nlp-machine-learning
Rated Ranking Evaluator
Search Quality Evaluation Tool for Apache Solr & Elasticsearch search-based infrastructures
Stars: ✭ 134 (+67.5%)
Mutual labels:  search-engine, information-retrieval
Haystack
🔍 Haystack is an open source NLP framework that leverages Transformer models. It enables developers to implement production-ready neural search, question answering, semantic document search and summarization for a wide range of applications.
Stars: ✭ 3,409 (+4161.25%)
Mutual labels:  search-engine, information-retrieval
evildork
Evildork targeting your fiancee👁️
Stars: ✭ 46 (-42.5%)
Mutual labels:  search-engine, information-retrieval
Aquiladb
Drop in solution for Decentralized Neural Information Retrieval. Index latent vectors along with JSON metadata and do efficient k-NN search.
Stars: ✭ 222 (+177.5%)
Mutual labels:  search-engine, information-retrieval
lucene
Apache Lucene open-source search software
Stars: ✭ 1,009 (+1161.25%)
Mutual labels:  search-engine, information-retrieval
Search Engine
A math-aware search engine.
Stars: ✭ 278 (+247.5%)
Mutual labels:  search-engine, information-retrieval
solr
Apache Solr open-source search software
Stars: ✭ 651 (+713.75%)
Mutual labels:  search-engine, information-retrieval
Pisa
PISA: Performant Indexes and Search for Academia
Stars: ✭ 489 (+511.25%)
Mutual labels:  search-engine, information-retrieval
kex
Kex is a python library for unsupervised keyword extraction from a document, providing an easy interface and benchmarks on 15 public datasets.
Stars: ✭ 46 (-42.5%)
Mutual labels:  information-retrieval, nlp-machine-learning
ake-datasets
Large, curated set of benchmark datasets for evaluating automatic keyphrase extraction algorithms.
Stars: ✭ 125 (+56.25%)
Mutual labels:  information-retrieval, nlp-machine-learning
Vectorsinsearch
Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Searching with Vectors' talk from Haystack 2019 (US). Builds upon my conceptual search and semantic search work from 2015
Stars: ✭ 71 (-11.25%)
Mutual labels:  search-engine, information-retrieval
Bm25
A Python implementation of the BM25 ranking function.
Stars: ✭ 159 (+98.75%)
Mutual labels:  search-engine, information-retrieval

Query-wellformedness Dataset

25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.

http://goo.gl/language/query-wellformedness

Description

Google's query wellformedness dataset was created by crowdsourcing well-formedness annotations for 25,100 queries from the Paralex corpus. Every query was annotated by five raters each with 1/0 rating of whether or not the query is well-formed. For further details please read our paper: Identifying Well-formed Natural Language Questions

For each query we provide the average of the 5 binary judgements as the wellformedness score for the query. Following are some examples of queries present in the dataset:

Query Wellformedness rating
Which form of government is still in place in greece ? 1.0
Population of owls just in north america ? 0.0
Is johnny depp a celtic fan ? 0.8
Where did Roald Dahl live in his teenaged years ? 0.6

The dataset is divided into three files: train.tsv, dev.tsv and test.tsv each containing rated queries. The size of the files is as follows:

File No. of queries
train.tsv 17,500
dev.tsv 3,750
test.tsv 3,850

Examples

The examples in each file are tab separated containing the following columns:

Column Content
1 The European Union includes how many ?
2 0.2

Reference

If you use or discuss this dataset in your work, please cite our paper:

@InProceedings{FaruquiDas2018,
  title = {{Identifying Well-formed Natural Language Questions}},
  author = {Faruqui, Manaal and Das, Dipanjan},
  booktitle = {Proc. of EMNLP},
  year = {2018}
}

License

Query-wellformedness dataset is licensed under CC BY-SA 4.0. Any third party content or data is provided “As Is” without any warranty, express or implied.

Contact

If you have a technical question regarding the dataset or publication, please create an issue in this repository.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].