All Projects → Praneet9 → Representation-Learning-for-Information-Extraction

Praneet9 / Representation-Learning-for-Information-Extraction

Licence: Apache-2.0 license
Pytorch implementation of Paper by Google Research - Representation Learning for Information Extraction from Form-like Documents.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Representation-Learning-for-Information-Extraction

simclr-pytorch
PyTorch implementation of SimCLR: supports multi-GPU training and closely reproduces results
Stars: ✭ 89 (+8.54%)
Mutual labels:  representation-learning, pytorch-implementation
TitleStylist
Source code for our "TitleStylist" paper at ACL 2020
Stars: ✭ 72 (-12.2%)
Mutual labels:  transformer, pytorch-implementation
TailCalibX
Pytorch implementation of Feature Generation for Long-Tail Classification by Rahul Vigneswaran, Marc T Law, Vineeth N Balasubramaniam and Makarand Tapaswi
Stars: ✭ 32 (-60.98%)
Mutual labels:  representation-learning, pytorch-implementation
ClusterTransformer
Topic clustering library built on Transformer embeddings and cosine similarity metrics.Compatible with all BERT base transformers from huggingface.
Stars: ✭ 36 (-56.1%)
Mutual labels:  transformer, pytorch-implementation
semantic-document-relations
Implementation, trained models and result data for the paper "Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles"
Stars: ✭ 21 (-74.39%)
Mutual labels:  transformer, document
Pytorch Seq2seq
Tutorials on implementing a few sequence-to-sequence (seq2seq) models with PyTorch and TorchText.
Stars: ✭ 3,418 (+4068.29%)
Mutual labels:  transformer, pytorch-implementation
DocuNet
Code and dataset for the IJCAI 2021 paper "Document-level Relation Extraction as Semantic Segmentation".
Stars: ✭ 84 (+2.44%)
Mutual labels:  document, pytorch-implementation
AdaSpeech
AdaSpeech: Adaptive Text to Speech for Custom Voice
Stars: ✭ 108 (+31.71%)
Mutual labels:  transformer, pytorch-implementation
Walk-Transformer
From Random Walks to Transformer for Learning Node Embeddings (ECML-PKDD 2020) (In Pytorch and Tensorflow)
Stars: ✭ 26 (-68.29%)
Mutual labels:  transformer, pytorch-implementation
Meta Emb
Multilingual Meta-Embeddings for Named Entity Recognition (RepL4NLP & EMNLP 2019)
Stars: ✭ 28 (-65.85%)
Mutual labels:  transformer, representation-learning
VT-UNet
[MICCAI2022] This is an official PyTorch implementation for A Robust Volumetric Transformer for Accurate 3D Tumor Segmentation
Stars: ✭ 151 (+84.15%)
Mutual labels:  transformer, pytorch-implementation
MolDQN-pytorch
A PyTorch Implementation of "Optimization of Molecules via Deep Reinforcement Learning".
Stars: ✭ 58 (-29.27%)
Mutual labels:  pytorch-implementation
Supervised-Contrastive-Learning-in-TensorFlow-2
Implements the ideas presented in https://arxiv.org/pdf/2004.11362v1.pdf by Khosla et al.
Stars: ✭ 117 (+42.68%)
Mutual labels:  representation-learning
densecap
Dense video captioning in PyTorch
Stars: ✭ 37 (-54.88%)
Mutual labels:  transformer
tool-db
A peer-to-peer decentralized database
Stars: ✭ 15 (-81.71%)
Mutual labels:  document
t5-japanese
Codes to pre-train Japanese T5 models
Stars: ✭ 39 (-52.44%)
Mutual labels:  transformer
blazing-bookkeeper
Who has got time to read receipts? Let Blazing Bookkeeper blaze through all your receipts in no time.
Stars: ✭ 24 (-70.73%)
Mutual labels:  receipts
pgdl
Winning Solution of the NeurIPS 2020 Competition on Predicting Generalization in Deep Learning
Stars: ✭ 36 (-56.1%)
Mutual labels:  representation-learning
SegSwap
(CVPRW 2022) Learning Co-segmentation by Segment Swapping for Retrieval and Discovery
Stars: ✭ 46 (-43.9%)
Mutual labels:  transformer
project-organization
YunoHost project organization
Stars: ✭ 31 (-62.2%)
Mutual labels:  document

ReLIE: Representation-Learning-for-Information-Extraction

This is an unofficial implementation of Representation Learning for Information Extraction (ReLIE) from Form-like Documents using PyTorch.

Model Architecture

Architecture

image source

Getting Started

  1. Clone the repository
git clone https://github.com/Praneet9/Representation-Learning-for-Information-Extraction.git
  1. Create a virtualenv and install the required packages
pip install -r requirements.txt

Prepare dataset

STEP 1: Annotation

Dataset can be created using an image annotation tool like labelImg which we have used in this project or any other tool which saves annotations in pascalVOC format in an XML file. To identify the true candidate for the required field, a bounding box must be drawn around the word which we want to extract. For our experiment, we have annotated the following fields.

  • Invoice Number
  • Invoice Date
  • Total Amount

Annotation Demo

STEP 2: Generate OCRs

Prerequisites:
We used tesseract 4.0 for generating OCR results
You can install tesseract from its official source here
Make sure you replace the default model with the LSTM model for best results.
Download the LSTM models from here
Once everything is setup, run the command below to generate tesseract results which will be saved in the tesseract_results_lstm directory.

$ python generate_tesseract_results.py

STEP 3: Extract Candidates

Modify the extract_candidates.py based on your dataset and classes.

  • Invoice numbers : Use Regular Expressions to extract the candidates for invoice number (Ex. 221233,1041-553337)

  • Amounts : Use Regular Expressions to extract the candidates for total amount (Ex. $222.32, $1200.44)

  • Dates : Use the dateparser to extract the candidates for invoice date

from dateparser.search import search_dates
search_dates(all_text)

STEP 4: Define dataset split and update config

Split dataset into train and validation set

specify dataset directory and split ratio in utility script and run:

python3 utils/prepare_split.py

Before running the training or evaluation script please modify the configurations as per your setup.

Train

python3 train.py

Evaluation

Coming Soon...

Inference

  • Get the inference results by running
python3 inference.py --image sample.jpg --cuda --cached_pickle output/cached_data.pickle --load_saved_model output/model.pth

You can expect result something like this - output

Citation

Representation Learning for Information Extraction from Form-like Documents

Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James B. Wendt, Qi Zhao, Marc Najork

Abstract
We propose a novel approach using representation learning for tackling the problem of extracting structured information from form-like document images. We propose an extraction system that uses knowledge of the types of the target fields to generate extraction candidates, and a neural network architecture that learns a dense representation of each candidate based on neighboring words in the document. These learned representations are not only useful in solving the extraction task for unseen document templates from two different domains, but are also interpretable, as we show using loss cases.

[Paper] [Google Blog]

@article{
  title={Representation Learning for Information Extraction from Form-like Documents},
  author={Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James B. Wendt, Qi Zhao, Marc Najork},
  journal = {Association for Computational Linguistics},
  year={2020}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].