All Projects → nguyenvo09 → EMNLP2020

nguyenvo09 / EMNLP2020

Licence: MIT license
This is official Pytorch code and datasets of the paper "Where Are the Facts? Searching for Fact-checked Information to Alleviate the Spread of Fake News", EMNLP 2020.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to EMNLP2020

Ranking
Learning to Rank in TensorFlow
Stars: ✭ 2,362 (+4194.55%)
Mutual labels:  information-retrieval, learning-to-rank
src
tools for fast reading of docs
Stars: ✭ 40 (-27.27%)
Mutual labels:  information-retrieval, learning-to-rank
gpl
Powerful unsupervised domain adaptation method for dense retrieval. Requires only unlabeled corpus and yields massive improvement: "GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval" https://arxiv.org/abs/2112.07577
Stars: ✭ 216 (+292.73%)
Mutual labels:  information-retrieval
captain-fact
📚 Documentation, wiki and community discussions
Stars: ✭ 59 (+7.27%)
Mutual labels:  fact-checking
LuceneTutorial
A simple tutorial of Lucene for LIS 501 Introduction to Text Mining students at the University of Wisconsin-Madison (Fall 2021).
Stars: ✭ 62 (+12.73%)
Mutual labels:  information-retrieval
ImageRetrieval
Content Based Image Retrieval Techniques (e.g. knn, svm using MatLab GUI)
Stars: ✭ 51 (-7.27%)
Mutual labels:  information-retrieval
awesome-pretrained-models-for-information-retrieval
A curated list of awesome papers related to pre-trained models for information retrieval (a.k.a., pretraining for IR).
Stars: ✭ 278 (+405.45%)
Mutual labels:  information-retrieval
pqlite
⚡ A fast embedded library for approximate nearest neighbor search
Stars: ✭ 141 (+156.36%)
Mutual labels:  information-retrieval
naacl2018-fever
Fact Extraction and VERification baseline published in NAACL2018
Stars: ✭ 109 (+98.18%)
Mutual labels:  information-retrieval
solr
Apache Solr open-source search software
Stars: ✭ 651 (+1083.64%)
Mutual labels:  information-retrieval
rust-stemmers
A rust implementation of some popular snowball stemming algorithms
Stars: ✭ 85 (+54.55%)
Mutual labels:  information-retrieval
SENet-for-Weakly-Supervised-Relation-Extraction
No description or website provided.
Stars: ✭ 39 (-29.09%)
Mutual labels:  information-retrieval
WWW2021
Official repository to release the code and datasets in the paper "Mining Dual Emotion for Fake News Detection", WWW 2021.
Stars: ✭ 45 (-18.18%)
Mutual labels:  fake-news-detection
netizenship
a commandline #OSINT tool to find the online presence of a username in popular social media websites like Facebook, Instagram, Twitter, etc.
Stars: ✭ 33 (-40%)
Mutual labels:  information-retrieval
query-wellformedness
25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.
Stars: ✭ 80 (+45.45%)
Mutual labels:  information-retrieval
ProQA
Progressively Pretrained Dense Corpus Index for Open-Domain QA and Information Retrieval
Stars: ✭ 44 (-20%)
Mutual labels:  information-retrieval
patzilla
PatZilla is a modular patent information research platform and data integration toolkit with a modern user interface and access to multiple data sources.
Stars: ✭ 71 (+29.09%)
Mutual labels:  information-retrieval
ConvDR
Code repo for SIGIR 2021 paper "Few-Shot Conversational Dense Retrieval"
Stars: ✭ 36 (-34.55%)
Mutual labels:  information-retrieval
beir
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
Stars: ✭ 738 (+1241.82%)
Mutual labels:  information-retrieval
BERT-QE
Code and resources for the paper "BERT-QE: Contextualized Query Expansion for Document Re-ranking".
Stars: ✭ 43 (-21.82%)
Mutual labels:  information-retrieval

EMNLP2020

This is the repository to reproduce results in the paper "Where Are the Facts? Searching for Fact-checked Information to Alleviate the Spread of Fake News", EMNLP 2020.

Multimodal Attention Network

alt text

Datasets

Snopes

PolitiFact

Images data

Structure of dataset folders

After downloading and extracting data, the expected structure of formatted_data is as follows:

EMNLP2020/
├── formatted_data
│   ├── Politifact
│   │   ├── 50_candidates_bm25_extended_reranking
│   │   │   ├── Politifact.dev.tsv
│   │   │   ├── Politifact.test.tsv
│   │   │   ├── Politifact.test2_hard.tsv
│   │   │   └── Politifact.train.tsv
│   │   ├── 50_candidates_bm25_extended_reranking_and_text_in_img
│   │   │   ├── Politifact.dev.tsv
│   │   │   ├── Politifact.test.tsv
│   │   │   ├── Politifact.test2_hard.tsv
│   │   │   └── Politifact.train.tsv
│   │   ├── 50_candidates_bm25_extended_reranking_and_text_in_img_avoid_bias
│   │   │   ├── Politifact.dev.tsv
│   │   │   ├── Politifact.test.tsv
│   │   │   ├── Politifact.test2_hard.tsv
│   │   │   └── Politifact.train.tsv
│   │   ├── article_mapped.json
│   │   ├── articles_content.json
│   │   ├── elmo_features_avoid_bias
│   │   │   ├── articles_feats.pth
│   │   │   └── queries_feats.pth
│   │   ├── elmo_features_only_text_in_tweets
│   │   │   ├── articles_feats.pth
│   │   │   └── queries_feats.pth
│   │   ├── elmo_features_use_text_in_img
│   │   │   ├── articles_feats.pth
│   │   │   └── queries_feats.pth
│   │   ├── queries_content.json
│   │   ├── query.negatives
│   │   ├── query_article_interaction.csv
│   │   └── query_mapped.json
│   └── Snopes
│       ├── 50_candidates_bm25_extended_reranking
│       │   ├── Snopes.dev.tsv
│       │   ├── Snopes.test.tsv
│       │   ├── Snopes.test2_hard.tsv
│       │   └── Snopes.train.tsv
│       ├── 50_candidates_bm25_extended_reranking_and_text_in_img
│       │   ├── Snopes.dev.tsv
│       │   ├── Snopes.test.tsv
│       │   ├── Snopes.test2_hard.tsv
│       │   └── Snopes.train.tsv
│       ├── 50_candidates_bm25_extended_reranking_and_text_in_img_avoid_bias
│       │   ├── Snopes.dev.tsv
│       │   ├── Snopes.test.tsv
│       │   ├── Snopes.test2_hard.tsv
│       │   └── Snopes.train.tsv
│       ├── article_mapped.json
│       ├── articles_content.json
│       ├── elmo_features_avoid_bias
│       │   ├── articles_feats.pth
│       │   └── queries_feats.pth
│       ├── elmo_features_only_text_in_tweets
│       │   ├── articles_feats.pth
│       │   └── queries_feats.pth
│       ├── elmo_features_use_text_in_img
│       │   ├── articles_feats.pth
│       │   └── queries_feats.pth
│       ├── queries_content.json
│       ├── query.negatives
│       ├── query_article_interaction.csv
│       └── query_mapped.json
├── images_data
│   ├── full_Snopes_extracted_features.pth
│   ├── full_images_otweet_DataC_extracted_features.pth
│   ├── resnet50_Politifact_documents_extracted_features.pth
│   └── resnet50_Polititact_queries_extracted_features.pth

Usage

1. Install required packages

We use Pytorch 0.4.1 and python 3.5.

pip install -r requirements.txt

2. Download and extract images data

pip install gdown
cd EMNLP2020
gdown https://drive.google.com/uc?id=17clyyiWyMDMUl6KqrDGGZCi2ZUeNSimh
unzip images_data.zip
rm images_data.zip

If you want to see raw images, you can download it as follows:

gdown https://drive.google.com/u/0/uc?id=11sxoTJx49TBOde_xFY-fgWcG-aHNFhAp
unzip raw_images.zip

3.1 Running SC1 (Table 2 in our paper)

For Snopes

gdown https://drive.google.com/uc?id=1S_WWvU1Q1bKElJ04E3MI7z_bLzPIPw5C
unzip SC1_snopes.zip -d formatted_data/Snopes
mkdir logs
python Masters/master_man.py --attention_type=4 \
                             --conv_layers=2 \
                             --cuda=1 \
                             --use_elmo=1 --use_visual=1 \
                             --filters=256 \
                             --filters_count_pacrr=16 \
                             --fixed_length_left=50 \
                             --fixed_length_right=1000 \
                             --log="logs/man" \
                             --loss_type="hinge" \
                             --max_ngram=1 \
                             --n_s=48 \
                             --path="formatted_data/Snopes/50_candidates_bm25_extended_reranking" \
                             --query_mapped="formatted_data/Snopes/query_mapped.json" \
                             --article_mapped="formatted_data/Snopes/article_mapped.json" \
                             --left_images_features="images_data/full_images_otweet_DataC_extracted_features.pth" \
                             --right_images_features="images_data/full_Snopes_extracted_features.pth" \
                             --elmo_feats="formatted_data/Snopes/elmo_features_only_text_in_tweets"

For PolitiFact

gdown https://drive.google.com/uc?id=1zeqlv3JeBn-ygn0juTO4SWBucZXIMKZi
unzip SC1_politifact.zip -d formatted_data/Politifact
python Masters/master_man.py --attention_type=4 \
                             --conv_layers=2 \
                             --cuda=1 \
                             --use_elmo=1 --use_visual=1 \
                             --filters=256 \
                             --filters_count_pacrr=16 \
                             --fixed_length_left=50 \
                             --fixed_length_right=1000 \
                             --log="logs/man" \
                             --loss_type="hinge" \
                             --max_ngram=1 \
                             --n_s=48 \
                             --path="formatted_data/Politifact/50_candidates_bm25_extended_reranking" \
                             --query_mapped="formatted_data/Politifact/query_mapped.json" \
                             --article_mapped="formatted_data/Politifact/article_mapped.json" \
                             --left_images_features="images_data/resnet50_Polititact_queries_extracted_features.pth" \
                             --right_images_features="images_data/resnet50_Politifact_documents_extracted_features.pth" \
                             --elmo_feats="formatted_data/Politifact/elmo_features_only_text_in_tweets"

3.2 Running SC2 (MAN in Table 3 in our paper)

For Snopes dataset

gdown https://drive.google.com/uc?id=1VDtJk_C-pZtBQXon2jvp4NTyxUnDv-gY
unzip SC2_snopes.zip -d formatted_data/Snopes
python Masters/master_man.py --attention_type=2 \
                             --conv_layers=2 \
                             --cuda=1 \
                             --use_elmo=1 --use_visual=1 \
                             --filters=256 \
                             --filters_count_pacrr=16 \
                             --fixed_length_left=100 \
                             --fixed_length_right=1000 \
                             --log="logs/man" \
                             --loss_type="hinge" \
                             --max_ngram=1 \
                             --n_s=32 \
                             --path="formatted_data/Snopes/50_candidates_bm25_extended_reranking_and_text_in_img" \
                             --query_mapped="formatted_data/Snopes/query_mapped.json" \
                             --article_mapped="formatted_data/Snopes/article_mapped.json" \
                             --left_images_features="images_data/full_images_otweet_DataC_extracted_features.pth" \
                             --right_images_features="images_data/full_Snopes_extracted_features.pth" \
                             --elmo_feats="formatted_data/Snopes/elmo_features_use_text_in_img"

For Politifact dataset

gdown https://drive.google.com/uc?id=1UDPJdnawYZiicx02shywYGQ3c091Q8xW
unzip SC2_politifact.zip -d formatted_data/Politifact
python Masters/master_man.py --attention_type=2 \
                             --conv_layers=3 \
                             --cuda=1 \
                             --use_elmo=1 --use_visual=1 \
                             --filters=256 \
                             --filters_count_pacrr=16 \
                             --fixed_length_left=100 \
                             --fixed_length_right=1000 \
                             --log="logs/man" \
                             --loss_type="hinge" \
                             --max_ngram=1 \
                             --n_s=32 \
                             --path="formatted_data/Politifact/50_candidates_bm25_extended_reranking_and_text_in_img" \
                             --query_mapped="formatted_data/Politifact/query_mapped.json" \
                             --article_mapped="formatted_data/Politifact/article_mapped.json" \
                             --left_images_features="images_data/resnet50_Polititact_queries_extracted_features.pth" \
                             --right_images_features="images_data/resnet50_Politifact_documents_extracted_features.pth" \
                             --elmo_feats="formatted_data/Politifact/elmo_features_use_text_in_img"

3.3 Running SC2 with augmented data (MAN-A in Table 3 in our paper)

This test is memory-intensive so we recommend to run this test on a server with 64Gb RAM.

For Snopes dataset

gdown https://drive.google.com/u/0/uc?id=1GDONqAZ5lllmF-_XMgk4gVnJNyLP079v
unzip augment_snopes.zip -d formatted_data/Snopes
python Masters/master_man.py --attention_type=2 \
                             --conv_layers=2 \
                             --cuda=1 \
                             --use_elmo=1 --use_visual=1 \
                             --filters=256 \
                             --filters_count_pacrr=16 \
                             --fixed_length_left=100 \
                             --fixed_length_right=1000 \
                             --log="logs/man" \
                             --loss_type="hinge" \
                             --max_ngram=2 \
                             --n_s=32 \
                             --path="formatted_data/Snopes/50_candidates_bm25_extended_reranking_and_text_in_img_avoid_bias" \
                             --query_mapped="formatted_data/Snopes/query_mapped.json" \
                             --article_mapped="formatted_data/Snopes/article_mapped.json" \
                             --left_images_features="images_data/full_images_otweet_DataC_extracted_features.pth" \
                             --right_images_features="images_data/full_Snopes_extracted_features.pth" \
                             --elmo_feats="formatted_data/Snopes/elmo_features_avoid_bias"

For PolitiFact dataset

gdown https://drive.google.com/u/0/uc?id=10e1JhhbfQWYILkovaeopGuhD1VQ_ZPYc
unzip augment_politifact.zip -d formatted_data/Politifact
python Masters/master_man.py --attention_type=4 \
                             --conv_layers=2 \
                             --cuda=1 \
                             --use_elmo=1 --use_visual=1 \
                             --filters=256 \
                             --filters_count_pacrr=16 \
                             --fixed_length_left=100 \
                             --fixed_length_right=1000 \
                             --log="logs/man" \
                             --loss_type="hinge" \
                             --max_ngram=3 \
                             --n_s=48 \
                             --path="formatted_data/Politifact/50_candidates_bm25_extended_reranking_and_text_in_img_avoid_bias" \
                             --query_mapped="formatted_data/Politifact/query_mapped.json" \
                             --article_mapped="formatted_data/Politifact/article_mapped.json" \
                             --left_images_features="images_data/resnet50_Polititact_queries_extracted_features.pth" \
                             --right_images_features="images_data/resnet50_Politifact_documents_extracted_features.pth" \
                             --elmo_feats="formatted_data/Politifact/elmo_features_avoid_bias"

Citation

If you feel our paper and resources are useful, please consider citing our work as follows:

@inproceedings{vo2020facts,
	title={Where Are the Facts? Searching for Fact-checked Information to Alleviate the Spread of Fake News},
	author={Vo, Nguyen and Lee, Kyumin},
	booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)},
	year={2020}
}

Slides of our talk at EMNLP 2020

https://slideslive.com/38938793/where-are-the-facts-searching-for-factchecked-information-to-alleviate-the-spread-of-fake-news

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].