All Projects → google-research-datasets → Wiki Split

google-research-datasets / Wiki Split

One million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.

Projects that are alternatives of or similar to Wiki Split

Nlp Pretrained Model
A collection of Natural language processing pre-trained models.
Stars: ✭ 122 (+28.42%)
Mutual labels:  deep-neural-networks, nlp-machine-learning
Dab
Data Augmentation by Backtranslation (DAB) ヽ( •_-)ᕗ
Stars: ✭ 294 (+209.47%)
Mutual labels:  deep-neural-networks, nlp-machine-learning
Character Based Cnn
Implementation of character based convolutional neural network
Stars: ✭ 205 (+115.79%)
Mutual labels:  deep-neural-networks, nlp-machine-learning
Deeppavlov
An open source library for deep learning end-to-end dialog systems and chatbots.
Stars: ✭ 5,525 (+5715.79%)
Mutual labels:  deep-neural-networks, nlp-machine-learning
Hierarchical Attention Networks Pytorch
Hierarchical Attention Networks for document classification
Stars: ✭ 239 (+151.58%)
Mutual labels:  deep-neural-networks, nlp-machine-learning
Letslearnai.github.io
Lets Learn AI
Stars: ✭ 33 (-65.26%)
Mutual labels:  deep-neural-networks, nlp-machine-learning
Deepicf
TensorFlow Implementation of Deep Item-based Collaborative Filtering Model for Top-N Recommendation
Stars: ✭ 86 (-9.47%)
Mutual labels:  deep-neural-networks
Lda Topic Modeling
A PureScript, browser-based implementation of LDA topic modeling.
Stars: ✭ 91 (-4.21%)
Mutual labels:  nlp-machine-learning
Niftynet
[unmaintained] An open-source convolutional neural networks platform for research in medical image analysis and image-guided therapy
Stars: ✭ 1,276 (+1243.16%)
Mutual labels:  deep-neural-networks
Summarus
Models for automatic abstractive summarization
Stars: ✭ 83 (-12.63%)
Mutual labels:  nlp-machine-learning
360sd Net
Pytorch implementation of ICRA 2020 paper "360° Stereo Depth Estimation with Learnable Cost Volume"
Stars: ✭ 94 (-1.05%)
Mutual labels:  deep-neural-networks
Ngraph
nGraph has moved to OpenVINO
Stars: ✭ 1,322 (+1291.58%)
Mutual labels:  deep-neural-networks
Bert As Service
Mapping a variable-length sentence to a fixed-length vector using BERT model
Stars: ✭ 9,779 (+10193.68%)
Mutual labels:  deep-neural-networks
Facial Expression Recognition
💡My Solution to Facial Emotion Recognization in Kaggle competition
Stars: ✭ 88 (-7.37%)
Mutual labels:  deep-neural-networks
Doc2vec
📓 Long(er) text representation and classification using Doc2Vec embeddings
Stars: ✭ 92 (-3.16%)
Mutual labels:  nlp-machine-learning
Text classification
Text Classification Algorithms: A Survey
Stars: ✭ 1,276 (+1243.16%)
Mutual labels:  nlp-machine-learning
Linq To Wiki
.Net library to access MediaWiki API
Stars: ✭ 93 (-2.11%)
Mutual labels:  wikipedia
Breast Cancer Classification
Breast Cancer Classification using CNN and transfer learning
Stars: ✭ 86 (-9.47%)
Mutual labels:  deep-neural-networks
Mediawiki
MediaWiki API wrapper in python http://pymediawiki.readthedocs.io/en/latest/
Stars: ✭ 89 (-6.32%)
Mutual labels:  wikipedia
Learn Ml Basics
A collection of resources that should help and guide your first steps as you learn ML and DL. I am a beginner as well, and these are the resources I found most useful.
Stars: ✭ 93 (-2.11%)
Mutual labels:  deep-neural-networks

WikiSplit Dataset

One million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.

http://goo.gl/language/wiki-split

Update (3 June 2019): The source code for the evaluations in our paper has been released into a separate repository: https://github.com/google-research/google-research/tree/master/wiki_split_bleu_eval.

Description

Google's WikiSplit dataset was constructed automatically from the publicly available Wikipedia revision history. Although the dataset contains some inherent noise, it can serve as valuable training data for models that split or merge sentences.

For further details about the construction of the dataset and its use for model training, see the accompanying paper: Learning to Split and Rephrase From Wikipedia Edit History

If you use or discuss this dataset in your work, please cite our paper:

@InProceedings{BothaEtAl2018,
  title = {{Learning To Split and Rephrase From Wikipedia Edit History}},
  author = {Botha, Jan A and Faruqui, Manaal and Alex, John and Baldridge, Jason and Das, Dipanjan},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
  pages = {to appear},
  note = {arXiv preprint arXiv:1808.09468},
  year = {2018}
}

Examples

  • Due to the hurricane , Lobsterfest has been canceled , making Bob very happy about it and he decides to open Bob 's Burgers for customers who were planning on going to Lobsterfest .

    • Due to the hurricane , Lobsterfest has been canceled , making Bob ecstatic .
    • He decides to open Bob 's Burgers for customers who were planning on going to Lobsterfest .
  • Her family is rumored to be a large financial clique which controls the underworld of Japan , but rarely people know the unhappiness which she suffered for being born in such a troublesome family .

    • Her family is rumored to be a large financial clique which controls the underworld of Japan .
    • People are unaware of the unhappiness which she suffered for being born in such a troublesome family .

Data format

The dataset is released as text files formatted as tab-separated values (TSV) according to the following schema:

Column Meaning
1 unsplit single sentence
2 split-up sentences, delimited by the string <::::>

The sentences have already been tokenized on punctuation.

Example data item

Due to the hurricane , Lobsterfest has been canceled , making Bob very happy about it and he decides to open Bob 's Burgers for customers who were planning on going to Lobsterfest .	Due to the hurricane , Lobsterfest has been canceled , making Bob ecstatic . <::::> He decides to open Bob 's Burgers for customers who were planning on going to Lobsterfest .

Dataset statistics

Part Instances Tokens* Vocabulary*
train.tsv 989,944 33,084,465 632,588
tune.tsv 5,000 167,456 25,871
validation.tsv 5,000 166,628 25,251
test.tsv 5,000 167,853 25,386

*counted over the unsplit sentences

Result on WebSplit 1.0 Benchmark

Our paper introducing the WikiSplit dataset applied it to the split-and-rephrase task. The main result is that including WikiSplit during model training leads to improved generalization and dramatically better output on the WebSplit 1.0 test set.

Corpus BLEU
Source (i.e. just echoing the input sentence 58.7
Model trained on...
     WebSplit only (Aharoni & Goldberg, 2017) 30.5
     WebSplit + WikiSplit (Botha et al., 2018)* 62.4

See paper for details.

For the sake of direct comparisons to this work, the evaluation source code is available at: https://github.com/google-research/google-research/tree/master/wiki_split_bleu_eval.

License

The WikiSplit dataset is a verbatim copy of certain content from the publicly available Wikipedia revision history. The dataset is therefore licensed under CC BY-SA 4.0. Any third party content or data is provided "As Is" without any warranty, express or implied.

Contact

If you have a technical question regarding the dataset or publication, please create an issue in this repository.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].