COVID-19 Infodemic Twitter Dataset

This repository contains a dataset consisting of tweets annotated with fine-grained labels related to disinformation about COVID-19. The labels answer seven different questions that are of interests to journalists, fact-checkers, social media platforms, policymakers, and society as a whole. There are annotations for Arabic and English.

To label the dataset, we prepared comprehensive annotation guidelines [1], which can help similar tasks in different domains. Moreover, we launched an annotation platform to label tweets, where anyone can contribute and help increase the size of the dataset, which we will be updating here periodically.

Table of contents:

Help the community to label more data
Questions with Labels
List of Versions
Contents of the Distribution
Download
Publication
Credits
Licensing
Contact
Acknowledgment

Help the community to label more data

We also invite you to join us to label tweets related to COVID-19 disinformation.

To annotate we recommend you to register to micromapper and then login for the annotation. However, one can annotate with any registration.

Please go to any of the the following links

English
Arabic
Then, either click Start Contributing Now or Contribute. This will lead to a page with annotation instructions. Please, scroll down and click Start contributing.

You can now start annotating.

An example of the annotation page looks as follows:

Questions with Labels

Below is the list of the questions and the possible labels (answers). See the paper below or the above micromappers links for detailed definition of the annotation guidelines.

1. Does the tweet contain a verifiable factual claim?
Labels:

YES: if it contains a verifiable factual claim;
NO: if it does not contain a verifiable factual claim;
Don’t know or can’t judge: the content of the tweet does not have enough information to make a judgment. It is recommended to categorize the tweet using this label when the content of the tweet is not understandable at all. For example, it uses a language (i.e., non-English) or references that are difficult to understand;

2. To what extent does the tweet appear to contain false information?
Labels:

NO, definitely contains no false information
NO, probably contains no false information
Not sure
YES, probably contains false information
YES, definitely contains false information

3. Will the tweet’s claim have an effect on or be of interest to the general public?
Labels:

NO, definitely not of interest
NO, probably not of interest
Not sure
YES, probably of interest
YES, definitely of interest

4. To what extent does the tweet appear to be harmful to society, person(s), company(s) or product(s)?
Labels:

NO, definitely not harmful
NO, probably not harmful
Not sure
YES, probably harmful
YES, definitely harmful

5. Do you think that a professional fact-checker should verify the claim in the tweet?
Labels:

NO, no need to check
NO, too trivial to check
YES, not urgent
YES, very urgent
Not sure

6. Is the tweet harmful for society and why?
Labels:

NO, not harmful
NO, joke or sarcasm
Not sure
YES, panic
YES, xenophobic, racist, prejudices, or hate-speech
YES, bad cure
YES, rumor or conspiracy
YES, other

7. Do you think that this tweet should get the attention of a government entity?
Labels:

NO, not interesting
Not sure
YES, categorized as in question 6
YES, other
YES, blame authorities
YES, contains advice
YES, calls for action
YES, discusses action taken
YES, discusses cure
YES, asks question

List of Versions

===================
v1.0 [2020/05/01]: initial distribution of the annotated dataset

English data: 504 tweets
Arabic data: 218 tweets

Contents of the Distribution

===============================================

Directory Structure

=======================

The directory contains the following two sub-directories:

Readme.txt this file

"English": This directory contains tab-separated values (i.e., TSV) file, and one JSON file. The TSV file stores ground-truth annotations for the aforementioned tasks. The data format of these files is described in detail below. Each line in the JSON file corresponds to data from a single tweet stored in JSON format (as downloaded from Twitter).
"Arabic": Similarly to English, this directory contains one TSV file and one JSON file using the same format.

Format of the TSV files under the "annotations" directory

Each TSV file in this directory contains the following columns, separated by a tab:

tweet_id: corresponds to the actual tweet id from Twitter.
tweet_text: corresponds to the original text of a given tweet as downloaded from Twitter.
q*_label (column 3-9): corresponds to the label for question 1 to 7.

Note that there are NA (i.e., null) entries in the TSV files that simply indicate "not applicable" cases. We label NA for question 2 to 5 when question 1 is labeled as NO.

Examples

============

Please don't take hydroxychloroquine (Plaquenil) plus Azithromycin for #COVID19 UNLESS your doctor prescribes it. Both drugs affect the QT interval of your heart and can lead to arrhythmias and sudden death, especially if you are taking other meds or have a heart condition.
Labels:

Q1: Yes;
Q2: NO: probably contains no false info
Q3: YES: definitely of interest
Q4: NO: probably not harmful
Q5: YES:very-urgent
Q6: NO:not-harmful
Q7: NO: YES:discusses_cure

BREAKING: @MBuhari’s Chief Of Staff, Abba Kyari, Reportedly Sick, Suspected Of Contracting #Coronavirus | Sahara Reporters A top government source told SR on Monday that Kyari has been seriously “down” since returning from a trip abroad. READ MORE: https://t.co/Acy5NcbMzQ https://t.co/kStp4cmFlr.
Labels:

Q1: Yes;
Q2: NO: probably contains no false info
Q3: YES: definitely of interest
Q4: NO: definitely not harmful
Q5: YES:not-urgent
Q6: YES:rumor
NO: YES:classified_as_in_question_6

Statistics

=============
Some statistics about the dataset

English tweets:

Q1 = 504 labeled tweets

no 209
yes 295

Q2 = 295 labeled tweets

1_no_definitely_contains_no_false_info 47
2_no_probably_contains_no_false_info 171
3_not_sure 40
4_yes_probably_contains_false_info 25
5_yes_definitely_contains_false_info 12

Q3 = 295 labeled tweets

1_no_definitely_not_of_interest 9
2_no_probably_not_of_interest 44
3_not_sure 7
4_yes_probably_of_interest 177
5_yes_definitely_of_interest 58

Q4 = 295 labeled tweets

1_no_definitely_not_harmful 106
2_no_probably_not_harmful 66
3_not_sure 2
4_yes_probably_harmful 67
5_yes_definitely_harmful 54

Q5 = 295 labeled tweets

no_no_need_to_check 77
no_too_trivial_to_check 57
yes_not_urgent 112
yes_very_urgent 49

Q6 = 504 labeled tweets

no_joke_or_sarcasm 62
no_not_harmful 333
not_sure 2
yes_bad_cure 3
yes_other 25
yes_panic 23
yes_rumor_conspiracy 42
yes_xenophobic_racist_prejudices_or_hate_speech 14

Q7 = 504 labeled tweets

no_not_interesting 319
not_sure 6
yes_asks_question 2
yes_blame_authorities 81
yes_calls_for_action 8
yes_classified_as_in_question_6 34
yes_contains_advice 9
yes_discusses_action_taken 12
yes_discusses_cure 5
yes_other 28

Arabic tweets:

Q1 = 218 labeled tweets

no 78
yes 140

Q2 = 140 labeled tweets

1_no_definitely_contains_no_false_info 31
2_no_probably_contains_no_false_info 62
3_not_sure 5
4_yes_probably_contains_false_info 40
5_yes_definitely_contains_false_info 2

Q3 = 140 labeled tweets

1_no_definitely_not_of_interest 1
2_no_probably_not_of_interest 5
3_not_sure 9
4_yes_probably_of_interest 76
5_yes_definitely_of_interest 49

Q4 = 140 labeled tweets

1_no_definitely_not_harmful 68
2_no_probably_not_harmful 21
3_not_sure 3
4_yes_probably_harmful 46
5_yes_definitely_harmful 2

Q5 = 140 labeled tweets

no_no_need_to_check 22
no_too_trivial_to_check 55
yes_not_urgent 48
yes_very_urgent 15

Q6 = 218 labeled tweets

no_joke_or_sarcasm 2
no_not_harmful 159
yes_bad_cure 1
yes_other 5
yes_panic 12
yes_rumor_conspiracy 33
yes_xenophobic_racist_prejudices_or_hate_speech 6

Q7 = 218 labeled tweets

no_not_interesting 163
yes_blame_authorities 13
yes_calls_for_action 1
yes_classified_as_in_question_6 30
yes_contains_advice 1
yes_discusses_cure 6
yes_other 4

Download

To download the dataset, just fill up this form.

Publications:

Please cite the following papers if you are using the data or annotation guidelines

Firoj Alam, Fahim Dalvi, Shaden Shaar, Nadir Durrani, Hamdy Mubarak, Alex Nikolov, Giovanni Da San Martino,3Ahmed Abdelali,1Hassan Sajjad,1Kareem Darwish,1Preslav Nakov, "Fighting the COVID-19 Infodemic in Social Media: A Holistic Perspective and a Call to Arms", Proceedings of the International AAAI Conference on Web and Social Media. (Vol. 15, pp. 913-922). 2021. download.
Firoj Alam and Shaden Shaar and Fahim Dalvi and Hassan Sajjad and Alex Nikolov and Hamdy Mubarak and Giovanni Da San Martino and Ahmed Abdelali and Nadir Durrani and Kareem Darwish and Abdulaziz Al-Homaid and Wajdi Zaghouani and Tommaso Caselli and Gijs Danoe and Friso Stolk and Britt Bruntink and Preslav Nakov, "Fighting the COVID-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society", Findings of EMNLP 2021, download.

@InProceedings{alam2020call2arms,
  title		= {Fighting the {COVID}-19 Infodemic in Social Media: A
		  Holistic Perspective and a Call to Arms},
  author	= {Alam, Firoj and Dalvi, Fahim and Shaar, Shaden and
		  Durrani, Nadir and Mubarak, Hamdy and Nikolov, Alex and {Da
		  San Martino}, Giovanni and Abdelali, Ahmed and Sajjad,
		  Hassan and Darwish, Kareem and Nakov, Preslav},
  year		= {2021},
  pages		= {913-922},
  month	= {May},
  volume	= {15},
  booktitle	= {Proceedings of the International {AAAI} Conference on Web
		  and Social Media},
  series	= {ICWSM~'21},
  url		= {https://ojs.aaai.org/index.php/ICWSM/article/view/18114}
}
@inproceedings{alam2020fighting,
    title={Fighting the {COVID}-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society},
    author={Firoj Alam and Shaden Shaar and Fahim Dalvi and Hassan Sajjad and Alex Nikolov and Hamdy Mubarak and Giovanni Da San Martino and Ahmed Abdelali and Nadir Durrani and Kareem Darwish and Abdulaziz Al-Homaid and Wajdi Zaghouani and Tommaso Caselli and Gijs Danoe and Friso Stolk and Britt Bruntink and Preslav Nakov},
    booktitle = {Findings of EMNLP 2021},
    year={2021},
}

Credits

Firoj Alam, Qatar Computing Research Institute, HBKU
Shaden Shaar, Qatar Computing Research Institute, HBKU
Alex Nikolov, Sofia University
Hamdy Mubarak, Qatar Computing Research Institute, HBKU
Giovanni Da San Martino, Qatar Computing Research Institute, HBKU
Ahmed Abdelali, Qatar Computing Research Institute, HBKU
Fahim Dalvi, Qatar Computing Research Institute, HBKU
Nadir Durrani, Qatar Computing Research Institute, HBKU
Hassan Sajjad, Qatar Computing Research Institute, HBKU
Kareem Darwish, Qatar Computing Research Institute, HBKU
Preslav Nakov, Qatar Computing Research Institute, HBKU

Licensing

This dataset is free for general research use.

Contact

Please contact [email protected]

Acknowledgment

Thanks to the QCRI's Crisis Computing team for facilitating us with Micromappers.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

firojalam / COVID-19-tweets-for-check-worthiness

Labels

Projects that are alternatives of or similar to COVID-19-tweets-for-check-worthiness