Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

DeFacto / Defactonlp

DeFactoNLP: An Automated Fact-checking System that uses Named Entity Recognition, TF-IDF vector comparison and Decomposable Attention models.

Programming Languages

python

139335 projects - #7 most used programming language

Labels

deep-learning attention ner tf-idf

Projects that are alternatives of or similar to Defactonlp

Ner Bert

BERT-NER (nert-bert) with google bert https://github.com/google-research.

Stars: ✭ 339 (+1030%)

Mutual labels: attention, ner

Nlp Journey

Documents, papers and codes related to Natural Language Processing, including Topic Model, Word Embedding, Named Entity Recognition, Text Classificatin, Text Generation, Text Similarity, Machine Translation)，etc. All codes are implemented intensorflow 2.0.

Stars: ✭ 1,290 (+4200%)

Mutual labels: attention, ner

Natasha

Solves basic Russian NLP tasks, API for lower level Natasha projects

Stars: ✭ 788 (+2526.67%)

Mutual labels: ner

Nlp Knowledge Graph

自然语言处理、知识图谱、对话系统三大技术研究与应用。

Stars: ✭ 908 (+2926.67%)

Mutual labels: ner

Cell Detr

Official and maintained implementation of the paper Attention-Based Transformers for Instance Segmentation of Cells in Microstructures [BIBM 2020].

Stars: ✭ 26 (-13.33%)

Mutual labels: attention

Spatial Transformer Network

A Tensorflow implementation of Spatial Transformer Networks.

Stars: ✭ 794 (+2546.67%)

Mutual labels: attention

Nlp tensorflow project

Use tensorflow to achieve some NLP project, eg: classification chatbot ner attention QAetc.

Stars: ✭ 27 (-10%)

Mutual labels: attention

Bert Chinese Ner

使用预训练语言模型BERT做中文NER

Stars: ✭ 758 (+2426.67%)

Mutual labels: ner

Recognizers Text

Microsoft.Recognizers.Text provides recognition and resolution of numbers, units, and date/time expressed in multiple languages (ZH, EN, FR, ES, PT, DE, IT, TR, HI. Partial support for NL, JA, KO, SV). Contributions are greatly welcome! Packages are available at https://www.nuget.org/profiles/Recognizers.Text and https://www.npmjs.com/~recognizers.text

Stars: ✭ 915 (+2950%)

Mutual labels: ner

Coursera Uw Machine Learning Clustering Retrieval

Stars: ✭ 25 (-16.67%)

Mutual labels: tf-idf

Tf ner

Simple and Efficient Tensorflow implementations of NER models with tf.estimator and tf.data

Stars: ✭ 876 (+2820%)

Mutual labels: ner

Pytorch Gat

My implementation of the original GAT paper (Veličković et al.). I've additionally included the playground.py file for visualizing the Cora dataset, GAT embeddings, an attention mechanism, and entropy histograms. I've supported both Cora (transductive) and PPI (inductive) examples!

Stars: ✭ 908 (+2926.67%)

Mutual labels: attention

Chatbot cn

基于金融-司法领域(兼有闲聊性质)的聊天机器人，其中的主要模块有信息抽取、NLU、NLG、知识图谱等，并且利用Django整合了前端展示,目前已经封装了nlp和kg的restful接口

Stars: ✭ 791 (+2536.67%)

Mutual labels: ner

Knowledge Graphs

A collection of research on knowledge graphs

Stars: ✭ 845 (+2716.67%)

Mutual labels: ner

Nlp In Practice

Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.

Stars: ✭ 790 (+2533.33%)

Mutual labels: tf-idf

Isab Pytorch

An implementation of (Induced) Set Attention Block, from the Set Transformers paper

Stars: ✭ 21 (-30%)

Mutual labels: attention

Lm Lstm Crf

Empower Sequence Labeling with Task-Aware Language Model

Stars: ✭ 778 (+2493.33%)

Mutual labels: ner

Sohu baseline

基于BERT的中文命名实体识别（pytorch）

Stars: ✭ 19 (-36.67%)

Mutual labels: ner

Chinesener

中文命名实体识别，实体抽取，tensorflow，pytorch，BiLSTM+CRF

Stars: ✭ 938 (+3026.67%)

Mutual labels: ner

Meta Emb

Multilingual Meta-Embeddings for Named Entity Recognition (RepL4NLP & EMNLP 2019)

Stars: ✭ 28 (-6.67%)

Mutual labels: ner

View All Similar Projects ➔

DeFactoNLP

DeFactoNLP is an automated fact-checking system designed for the FEVER 2018 Shared Task which was held at EMNLP 2018. It is capable of verifying claims and retrieving sentences from Wikipedia which support the assessment. This is accomplished through the usage of named entity recognition, TF-IDF vector comparison, and decomposable attention models.

Achievements

5th place in F1 Evidence Score
12th place in the FEVER Score

Cite

If you use this code, please cite:

@inproceedings{Reddy2018, 
title={DeFactoNLP: Fact Verification using Entity Recognition, TFIDF Vector Comparison and Decomposable Attention}, 
publisher={FEVER 2018, organised under EMNLP 2018}, 
author={Reddy, Aniketh Janardhan and Rocha, Gil and Esteves, Diego},
year={2018}
}

System Structure

The system is based on three major tasks (Document Retrieval, Sentence Retrieval, Label Classification). Each task was performed using different techniques:

Document Retrieval
- TF-IDF
- NER
- Triple-Based
Sentence Retrieval
- TF-IDF
- Triple-Based Model
- Sentence-Transformers
Label Classification
- RTE Model + Random Forest model

Run

You can run Document Retrieval and Sentence Retrieval by running the following script: generate_rte_preds.py.

The script contains the 6 boolean variables:

INCLUDE_NER --> if the input file contains NER Predicted DOCUMENTS and you want to include them as relevant documents
INCLUDE_TRIPLE_BASED --> if the input file contains Triple Based Predicted DOCUMENTS and you want to include them as relevant documents
INCLUDE_SENTENCE_BERT --> if the input file contains Sentence-Transformers Predicted SENTENCES and you want to include them as relevant sentences
RUN_DOC_TRIPLE_BASED --> to Predict Triple Based Relevant DOCUMENTS
RUN_SENT_TRIPLE_BASED --> to Predict Triple Based Relevant SENTENCES
RUN_RTE -> to run Recognising Textual Entailment to calculate the probabilities for every Relevant Sentences

Changing these variables will allow to run every step as required, making possible to run every step in a separate way, all at the same time or even include other Retrieval techniques using files with that information.

To generate the final predictions, run Label Classification

You can run all metrics using the script metrics.py.

In-Depth Information

Always check files paths before running anything

Data

All the files with the Claim information are in the Data folder. The files train.jsonl, dev.jsonl and test.jsonl are the files extracted from the FEVER database.

The Wikipedia corpus can be downloaded running the script download-raw-wiki.sh. To accelerate the algorithms, we divided every article into files using the script split_wiki_into_indv_docs.py.

We also created a train subsample using the script subsample_training_data.py.

The files subsample_train_relevant_docs.jsonl, shared_task_dev_public_relevant_docs.jsonl and shared_task_test_relevant_docs.jsonl contain the information from the TF-IDF part of Document Retrieval (predicted_pages) and Sentence Retrieval (predicted_sentences).

All the files have certain keywords. OIE stands for Open Information Extraction (in Document Retrieval). SENTENCE was performed a Triple-Based method for Sentence Selection. Important to verify the first line of every file to know what Retrieval Method was made.

TF-IDF (Document and Sentence Retrieval)

The TF-IDF results can be reproduced by running certain scripts inside fever-baselines folder. First, download the database and than, run the tf-idf part. The files are already generated and can be found in the data folder.

Label Classification

The Label Classification is performed training a Random Forest model. The train_label_classifier.py script will train and also predict the claim label based on the probabilities from the RTE model. The folder entailment_predictions_train contains already calculated probabilities for our subsample_train.jsonl. A file is genereted ready to be submitted.

Metrics

Running metrics.py will give you stats in detail about each of the three tasks.

Triple Based Model (Sentence Retrieval)

Running proof_extraction_train.py will train the model. You need to give as argument number 0 to create the dataset (relevant and non-relevant sentences, number 1 to extract all features from the sentences and number 2 to train the model. Ideally, use the three numbers as arguments.

Word2vec

For Document Retrieval we tried to use Word2vec model to extract the nearby documents to a given claim. Although promising, it didn't get the promising results. We think that is due to the lack of better training. There are two comparisons:

between the title of the document and the sentence
between the title of the document and every word of the claim (without stopwords) The main issue is how slow the process is. Indexing the titles would improve the processing speed. Number 2 is more promising, although even slower. You can find the code in word2vec.py.

Doc2vec

For Document Retrieval we tried to use Doc2vec model to extract the nearby documents to a given claim. You can find the code in doc2vec.py although it didn't give any promising results since the generated vectors for the claims and the Documents are very different.

NER and Triple Based (Document Retrieval)

The file doc_retrieval.py contains the information that was used to find the most relevant Documents using NER and also our Triple Based approach.

Sentence-Transformers

The code to run Sentence-Transformers is found run_sentence_selection.py. This will choose the top5 most similar sentences while run_sentence_selection_doc.py will choose the top2 sentences for every Retrieved Document. Our fine-tuning model script code is in train_sentence_model.py.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 30

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗