Praneet9 / Representation-Learning-for-Information-Extraction

Licence: Apache-2.0 license

Pytorch implementation of Paper by Google Research - Representation Learning for Information Extraction from Form-like Documents.

Programming Languages

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Representation-Learning-for-Information-Extraction

simclr-pytorch

PyTorch implementation of SimCLR: supports multi-GPU training and closely reproduces results

Stars: ✭ 89 (+8.54%)

Mutual labels: representation-learning, pytorch-implementation

TitleStylist

Source code for our "TitleStylist" paper at ACL 2020

Stars: ✭ 72 (-12.2%)

Mutual labels: transformer, pytorch-implementation

TailCalibX

Pytorch implementation of Feature Generation for Long-Tail Classification by Rahul Vigneswaran, Marc T Law, Vineeth N Balasubramaniam and Makarand Tapaswi

Stars: ✭ 32 (-60.98%)

Mutual labels: representation-learning, pytorch-implementation

ClusterTransformer

Topic clustering library built on Transformer embeddings and cosine similarity metrics.Compatible with all BERT base transformers from huggingface.

Stars: ✭ 36 (-56.1%)

Mutual labels: transformer, pytorch-implementation

semantic-document-relations

Implementation, trained models and result data for the paper "Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles"

Stars: ✭ 21 (-74.39%)

Mutual labels: transformer, document

Pytorch Seq2seq

Tutorials on implementing a few sequence-to-sequence (seq2seq) models with PyTorch and TorchText.

Stars: ✭ 3,418 (+4068.29%)

Mutual labels: transformer, pytorch-implementation

DocuNet

Code and dataset for the IJCAI 2021 paper "Document-level Relation Extraction as Semantic Segmentation".

Stars: ✭ 84 (+2.44%)

Mutual labels: document, pytorch-implementation

AdaSpeech

AdaSpeech: Adaptive Text to Speech for Custom Voice

Stars: ✭ 108 (+31.71%)

Mutual labels: transformer, pytorch-implementation

Walk-Transformer

From Random Walks to Transformer for Learning Node Embeddings (ECML-PKDD 2020) (In Pytorch and Tensorflow)

Stars: ✭ 26 (-68.29%)

Mutual labels: transformer, pytorch-implementation

Meta Emb

Multilingual Meta-Embeddings for Named Entity Recognition (RepL4NLP & EMNLP 2019)

Stars: ✭ 28 (-65.85%)

Mutual labels: transformer, representation-learning

VT-UNet

[MICCAI2022] This is an official PyTorch implementation for A Robust Volumetric Transformer for Accurate 3D Tumor Segmentation

Stars: ✭ 151 (+84.15%)

Mutual labels: transformer, pytorch-implementation

MolDQN-pytorch

A PyTorch Implementation of "Optimization of Molecules via Deep Reinforcement Learning".

Stars: ✭ 58 (-29.27%)

Mutual labels: pytorch-implementation

Supervised-Contrastive-Learning-in-TensorFlow-2

Implements the ideas presented in https://arxiv.org/pdf/2004.11362v1.pdf by Khosla et al.

Stars: ✭ 117 (+42.68%)

Mutual labels: representation-learning

densecap

Dense video captioning in PyTorch

Stars: ✭ 37 (-54.88%)

Mutual labels: transformer

tool-db

A peer-to-peer decentralized database

Stars: ✭ 15 (-81.71%)

Mutual labels: document

t5-japanese

Codes to pre-train Japanese T5 models

Stars: ✭ 39 (-52.44%)

Mutual labels: transformer

blazing-bookkeeper

Who has got time to read receipts? Let Blazing Bookkeeper blaze through all your receipts in no time.

Stars: ✭ 24 (-70.73%)

Mutual labels: receipts

pgdl

Winning Solution of the NeurIPS 2020 Competition on Predicting Generalization in Deep Learning

Stars: ✭ 36 (-56.1%)

Mutual labels: representation-learning

SegSwap

(CVPRW 2022) Learning Co-segmentation by Segment Swapping for Retrieval and Discovery

Stars: ✭ 46 (-43.9%)

Mutual labels: transformer

project-organization

YunoHost project organization

Stars: ✭ 31 (-62.2%)

Mutual labels: document

View All Similar Projects ➔

ReLIE: Representation-Learning-for-Information-Extraction

This is an unofficial implementation of Representation Learning for Information Extraction (ReLIE) from Form-like Documents using PyTorch.

Model Architecture

image source

Getting Started

Clone the repository

git clone https://github.com/Praneet9/Representation-Learning-for-Information-Extraction.git

Create a virtualenv and install the required packages

pip install -r requirements.txt

Prepare dataset

STEP 1: Annotation

Dataset can be created using an image annotation tool like labelImg which we have used in this project or any other tool which saves annotations in pascalVOC format in an XML file. To identify the true candidate for the required field, a bounding box must be drawn around the word which we want to extract. For our experiment, we have annotated the following fields.

Invoice Number
Invoice Date
Total Amount

STEP 2: Generate OCRs

Prerequisites:
We used tesseract 4.0 for generating OCR results
You can install tesseract from its official source here
Make sure you replace the default model with the LSTM model for best results.
Download the LSTM models from here
Once everything is setup, run the command below to generate tesseract results which will be saved in the tesseract_results_lstm directory.

$ python generate_tesseract_results.py

STEP 3: Extract Candidates

Modify the extract_candidates.py based on your dataset and classes.

Invoice numbers : Use Regular Expressions to extract the candidates for invoice number (Ex. 221233,1041-553337)
Amounts : Use Regular Expressions to extract the candidates for total amount (Ex. $222.32, $1200.44)
Dates : Use the dateparser to extract the candidates for invoice date

from dateparser.search import search_dates
search_dates(all_text)

STEP 4: Define dataset split and update config

Split dataset into train and validation set

specify dataset directory and split ratio in utility script and run:

python3 utils/prepare_split.py

Before running the training or evaluation script please modify the configurations as per your setup.

Train

python3 train.py

Evaluation

Coming Soon...

Inference

Get the inference results by running

python3 inference.py --image sample.jpg --cuda --cached_pickle output/cached_data.pickle --load_saved_model output/model.pth

You can expect result something like this -

Citation

Representation Learning for Information Extraction from Form-like Documents

Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James B. Wendt, Qi Zhao, Marc Najork

Abstract
We propose a novel approach using representation learning for tackling the problem of extracting structured information from form-like document images. We propose an extraction system that uses knowledge of the types of the target fields to generate extraction candidates, and a neural network architecture that learns a dense representation of each candidate based on neighboring words in the document. These learned representations are not only useful in solving the extraction task for unseen document templates from two different domains, but are also interpretable, as we show using loss cases.

[Paper] [Google Blog]

@article{
  title={Representation Learning for Information Extraction from Form-like Documents},
  author={Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James B. Wendt, Qi Zhao, Marc Najork},
  journal = {Association for Computational Linguistics},
  year={2020}
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Praneet9 / Representation-Learning-for-Information-Extraction

Programming Languages

Labels

Projects that are alternatives of or similar to Representation-Learning-for-Information-Extraction

ReLIE: Representation-Learning-for-Information-Extraction

Model Architecture

Getting Started

Prepare dataset

STEP 1: Annotation

STEP 2: Generate OCRs

STEP 3: Extract Candidates

STEP 4: Define dataset split and update config

Train

Evaluation

Inference

Citation

Representation Learning for Information Extraction from Form-like Documents