All Projects β†’ IBM β†’ science-result-extractor

IBM / science-result-extractor

Licence: Apache-2.0 License
No description or website provided.

Programming Languages

java
68154 projects - #9 most used programming language
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to science-result-extractor

bridging-resolution
No description or website provided.
Stars: ✭ 12 (-79.66%)
Mutual labels:  ibm-research, ibm-research-ai
evildork
Evildork targeting your fianceeπŸ‘οΈ
Stars: ✭ 46 (-22.03%)
Mutual labels:  information-extraction
InformationExtractionSystem
Information Extraction System can perform NLP tasks like Named Entity Recognition, Sentence Simplification, Relation Extraction etc.
Stars: ✭ 27 (-54.24%)
Mutual labels:  information-extraction
cord19q
COVID-19 Open Research Dataset (CORD-19) Analysis
Stars: ✭ 54 (-8.47%)
Mutual labels:  scientific-papers
slotminer
Tool for slot extraction from text
Stars: ✭ 15 (-74.58%)
Mutual labels:  information-extraction
EAD-Attack
Codes for reproducing the white-box adversarial attacks in β€œEAD: Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples,” AAAI 2018
Stars: ✭ 22 (-62.71%)
Mutual labels:  ibm-research-ai
iww
AI based web-wrapper for web-content-extraction
Stars: ✭ 61 (+3.39%)
Mutual labels:  information-extraction
CoVA-Web-Object-Detection
A Context-aware Visual Attention-based training pipeline for Object Detection from a Webpage screenshot!
Stars: ✭ 18 (-69.49%)
Mutual labels:  information-extraction
wen-notes
My notes.
Stars: ✭ 71 (+20.34%)
Mutual labels:  information-extraction
simple NER
simple rule based named entity recognition
Stars: ✭ 29 (-50.85%)
Mutual labels:  information-extraction
minie
An open information extraction system that provides compact extractions
Stars: ✭ 83 (+40.68%)
Mutual labels:  information-extraction
neji
Flexible and powerful platform for biomedical information extraction from text
Stars: ✭ 37 (-37.29%)
Mutual labels:  information-extraction
Hyper-Table-OCR
A carefully-designed OCR pipeline for universal boarded table recognition and reconstruction.
Stars: ✭ 96 (+62.71%)
Mutual labels:  table-extraction
Deep-NLP-Resources
Curated list of all NLP Resources
Stars: ✭ 65 (+10.17%)
Mutual labels:  information-extraction
PLE
Label Noise Reduction in Entity Typing (KDD'16)
Stars: ✭ 53 (-10.17%)
Mutual labels:  information-extraction
odinson
Odinson is a powerful and highly optimized open-source framework for rule-based information extraction. Odinson couples a simple, yet powerful pattern language that can operate over multiple representations of text, with a runtime system that operates in near real time.
Stars: ✭ 59 (+0%)
Mutual labels:  information-extraction
trinity-ie
Information extraction pipeline containing coreference resolution, named entity linking, and relationship extraction
Stars: ✭ 59 (+0%)
Mutual labels:  information-extraction
nested-ner-tacl2020-flair
Implementation of Nested Named Entity Recognition using Flair
Stars: ✭ 23 (-61.02%)
Mutual labels:  information-extraction
LNEx
πŸ“ 🏒 🏦 🏣 πŸͺ 🏬 LNEx: Location Name Extractor
Stars: ✭ 21 (-64.41%)
Mutual labels:  information-extraction
IE Paper Notes
Paper notes for Information Extraction, including Relation Extraction (RE), Named Entity Recognition (NER), Entity Linking (EL), Event Extraction (EE), Named Entity Disambiguation (NED).
Stars: ✭ 14 (-76.27%)
Mutual labels:  information-extraction

Science-result-extractor

Introduction

This repository contains code and a few datasets to extract TDMS (Task, Dataset, Metric, Score) tuples from scientific papers in the NLP domain. We envision three primary uses for this repository: (1) to extract table content from PDF files, (2) to replicate the paper's results or run experiments based on a textual entailment system, and (3) to train a model to extract TDM mentions. Please refer to the following paper for the full details:

Yufang Hou, Charles Jochim, Martin Gleize, Francesca Bonin, Debasis Ganguly. Identification of Tasks, Datasets, Evaluation Metrics, and Numeric Scores for Scientific Leaderboards Construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), Florence, Italy, 27 July - 2 August 2019

Yufang Hou, Charles Jochim, Martin Gleize, Francesca Bonin, Debasis Ganguly. TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics. In Proceedings of the 16th conference of the European Chapter of the Association for Computational Linguistics (EACL 2021), Online, 19-23 April 2021

Extract table content from PDF files

We developed a deterministic PDF table parser based on GROBID. To use our parser, follow the steps below:

  1. Fork and clone this repository, e.g.,
> git clone https://github.com/IBM/science-result-extractor.git
  1. Download and install GROBID 0.5.3, following the installation instructions, e.g.,
> wget https://github.com/kermitt2/grobid/archive/0.5.3.zip
> unzip 0.5.3.zip
> cd grobid-0.5.3/
> ./gradlew clean install

(note that gradlew must be installed beforehand)

  1. Configure pGrobidHome and pGrobidProperties in config.properties. The default configuration assumes that GROBID directory grobid-0.5.3 is a sister of the science-result-extractor directory.
pGrobidHome=../../grobid-0.5.3/grobid-home
pGrobidProperties=../../grobid-0.5.3/grobid-home/config/grobid.properties
  1. PdfInforExtractor provides methods to extract section content and table content from a given PDF file.

Run experiments based on textual entailment system

We release the training/testing datasets for all experiments described in the paper. You can find them under the data/exp directory. The results reported in the paper are based on the datasets under the data/exp/few-shot-setup/NLP-TDMS/paperVersion directory. We later further clean the datasets (e.g., remove five pdf files from the testing datasets which appear in the training datasets with a different name) and the clean version is under the data/exp/few-shot-setup/NLP-TDMS folder. Below we illustrate how to run experiments on the NLP-TDSM dataset in the few-shot setup to extract TDM pairs.

  1. Fork and clone this repository.

  2. Download or clone BERT.

  3. Copy run_classifier_sci.py into the BERT directory.

  4. Download BERT embeddings. We use the base uncased models.

  5. If we use BERT_DIR to point to the directory with the embeddings and DATA_DIR to point to the directory with our train and test data, we can run the textual entailment system with run_classifier_sci.py. For example:

> DATA_DIR=../data/exp/few-shot-setup/NLP-TDMS/
> BERT_DIR=./model/uncased_L-12_H-768_A-12/
> python3 run_classifier_sci.py --do_train=true --do_eval=false --do_predict=true --data_dir=${DATA_DIR} --task_name=sci --vocab_file=${BERT_DIR}/vocab.txt --bert_config_file=${BERT_DIR}/bert_config.json --init_checkpoint=${BERT_DIR}/bert_model.ckpt --output_dir=bert_tdms --max_seq_length=512 --train_batch_size=6 --predict_batch_size=6
  1. TEModelEvalOnNLPTDMS provides methods to evaluate TDMS tuples extraction.

  2. GenerateTestDataOnPDFPapers provides methods to generate testing dataset for any PDF papers.

Read NLP-TDMS and ARC-PDN corpora

  1. Follow the instructions in the README in data/NLP-TDMS/downloader/ to download the entire collection of raw PDFs of the NLP-TDMS dataset. The downloaded PDFs can be moved to data/NLP-TDMS/pdfFile (i.e., mv *.pdf ../pdfFile/.).

  2. For the ARC-PDN corpus, the original pdf files can be downloaded from the ACL Anthology Reference Corpus (Version 20160301). We use papers from ACL(P)/EMNLP(D)/NAACL(N) between 2010 and 2015. After uncompressing the downloaded PDF files, put the PDF files into the corresponding directories under the /data/ARC-PDN/ folder, e.g., copy D10 to /data/ARC-PDN/D/D10.

  3. We release the parsed NLP-TDMS and ARC-PDN corpora. NlpTDMSReader and ArcPDNReader in the corpus package illustrate how to read section and table contents from PDF files in these two corpora.

train a model to extract TDM mentions

We release the TDMSci corpus (under the data folder). The dataset is in the standard CoNLL format.

Citing science-result-extractor

Please cite the following paper when using science-result-extractor:

@inproceedings{houyufang2019acl,
  title={Identification of Tasks, Datasets, Evaluation Metrics, and Numeric Scores for Scientific Leaderboards Construction},
  author={Hou, Yufang and Jochim, Charles and Gleize, Martin and Bonin, Francesca and Ganguly, Debasis},
  booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, {\em Florence, Italy, 27 July -- 2 August 2019}},
  year      = {2019}
}

@inproceedings{houyufang2021eacl,
  title={TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics},
  author={Hou, Yufang and Jochim, Charles and Gleize, Martin and Bonin, Francesca and Ganguly, Debasis},
  booktitle = {Proceedings of the  the 16th conference of the European Chapter of the Association for Computational Linguistics, {\em Online, 19--23 April 2021}},
  year      = {2021}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].