All Projects → lanwuwei → Twitter-URL-Corpus

lanwuwei / Twitter-URL-Corpus

Licence: other
Large scale sentential paraphrases collection and annotation

Programming Languages

HTML
75241 projects
python
139335 projects - #7 most used programming language
CSS
56736 projects

Projects that are alternatives of or similar to Twitter-URL-Corpus

huabei
No description or website provided.
Stars: ✭ 42 (-6.67%)
Mutual labels:  paraphrase-identification
Quora-Paraphrase-Question-Identification
Paraphrase question identification using Feature Fusion Network (FFN).
Stars: ✭ 19 (-57.78%)
Mutual labels:  paraphrase-identification
MP-CNN-Variants
Variants of Multi-Perspective Convolutional Neural Networks
Stars: ✭ 22 (-51.11%)
Mutual labels:  paraphrase-identification
Kaggle-Quora-Question-Pairs
This is our team's solution report, which achieves top 10% (305/3307) in this competition.
Stars: ✭ 58 (+28.89%)
Mutual labels:  paraphrase-identification

Note

The download link is not working right now, please send email to [email protected] for data access.

News

Currently this repository contains 3-month raw data sample, and our 1-year URL data is available now: 2,869,657 candidate pairs. Please check our paraphrase website to download dataset.

Paraphrase-dataset

This repository contains code and data used in the following paper, please cite if you use it for your research:

@inproceedings{lan2017continuously,
  author     = {Lan, Wuwei and Qiu, Siyu and He, Hua and Xu, Wei},
  title      = {A Continuously Growing Dataset of Sentential Paraphrases},
  booktitle  = {Proceedings of The 2017 Conference on Empirical Methods on Natural Language Processing (EMNLP)},
  year       = {2017},
  publisher  = {Association for Computational Linguistics},
  pages      = {1235--1245},
  location   = {Copenhagen, Denmark}
  url        = {http://aclweb.org/anthology/D17-1127}
} 

A few notes

  1. Put your own Twitter keys into config.py and modify line 59 in main.py before running the code.
  2. Training and testing file is the subset of raw data with human annotation, both files have the same format, each line contains: sentence1 \tab sentence2 \tab (n,6) \tab url
  3. For each sentence pair, there are 6 Amazon Mechanical Turk workers annotating it. 1 representa paraphrase and 0 represents non-paraphrase. So totally n out 6 workers think the pair is paraphrase. If n<=2, we treat them as non-paraphrase; if n>=4, we treat them as paraphrase; if n==3, we discard them.
  4. After discarding n==3, we can get 42200 for training and 9334 for testing.

License

It is released for non-commercial use under the CC BY-NC-SA 3.0 license. Use of the data must abide by the Twitter Terms of Service and Developer Policy.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].