All Projects → yashkant → sam-textvqa

yashkant / sam-textvqa

Licence: other
Official code for paper "Spatially Aware Multimodal Transformers for TextVQA" published at ECCV, 2020.

Programming Languages

python
139335 projects - #7 most used programming language
c
50402 projects - #5 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to sam-textvqa

Openkai
OpenKAI: A modern framework for unmanned vehicle and robot control
Stars: ✭ 150 (+194.12%)
Mutual labels:  vision
Arc Robot Vision
MIT-Princeton Vision Toolbox for Robotic Pick-and-Place at the Amazon Robotics Challenge 2017 - Robotic Grasping and One-shot Recognition of Novel Objects with Deep Learning.
Stars: ✭ 224 (+339.22%)
Mutual labels:  vision
nested-transformer
Nested Hierarchical Transformer https://arxiv.org/pdf/2105.12723.pdf
Stars: ✭ 174 (+241.18%)
Mutual labels:  vision
Apriltag ros
A ROS wrapper of the AprilTag 3 visual fiducial detector
Stars: ✭ 160 (+213.73%)
Mutual labels:  vision
React Native Text Detector
Text Detector from image for react native using firebase MLKit on android and Tesseract on iOS
Stars: ✭ 194 (+280.39%)
Mutual labels:  vision
Amazing Arkit
ARKit相关资源汇总 群:326705018
Stars: ✭ 239 (+368.63%)
Mutual labels:  vision
Robotcar Dataset Sdk
Software Development Kit for the Oxford Robotcar Dataset
Stars: ✭ 151 (+196.08%)
Mutual labels:  vision
autonomous-delivery-robot
Repository for Autonomous Delivery Robot project of IvLabs, VNIT
Stars: ✭ 65 (+27.45%)
Mutual labels:  vision
Simplecv
Stars: ✭ 2,522 (+4845.1%)
Mutual labels:  vision
Learnable-Image-Resizing
TF 2 implementation Learning to Resize Images for Computer Vision Tasks (https://arxiv.org/abs/2103.09950v1).
Stars: ✭ 48 (-5.88%)
Mutual labels:  vision
Attendance Using Face
Face-recognition using Siamese network
Stars: ✭ 174 (+241.18%)
Mutual labels:  vision
Opticalflow visualization
Python optical flow visualization following Baker et al. (ICCV 2007) as used by the MPI-Sintel challenge
Stars: ✭ 183 (+258.82%)
Mutual labels:  vision
Opencv
📷 Computer-Vision Demos
Stars: ✭ 244 (+378.43%)
Mutual labels:  vision
Arucogen
Online ArUco markers generator
Stars: ✭ 155 (+203.92%)
Mutual labels:  vision
Grocery-Product-Detection
This repository builds a product detection model to recognize products from grocery shelf images.
Stars: ✭ 73 (+43.14%)
Mutual labels:  vision
Nextlevel
NextLevel was initally a weekend project that has now grown into a open community of camera platform enthusists. The software provides foundational components for managing media recording, camera interface customization, gestural interaction customization, and image streaming on iOS. The same capabilities can also be found in apps such as Snapchat, Instagram, and Vine.
Stars: ✭ 1,940 (+3703.92%)
Mutual labels:  vision
Cs231a Notes
The course notes for Stanford's CS231A course on computer vision
Stars: ✭ 230 (+350.98%)
Mutual labels:  vision
pybv
A lightweight I/O utility for the BrainVision data format, written in Python.
Stars: ✭ 18 (-64.71%)
Mutual labels:  vision
frc-score-detection
A program to detect FRC match scores from their livestream.
Stars: ✭ 15 (-70.59%)
Mutual labels:  vision
Expression-manipulator
ECCV'20 paper 'Toward Fine-grained Facial Expression Manipulation' code
Stars: ✭ 71 (+39.22%)
Mutual labels:  eccv

Spatially Aware Multimodal Transformers for TextVQA

Yash Kant, Dhruv Batra, Peter Anderson, Alex Schwing, Devi Parikh, Jiasen Lu, Harsh Agrawal
Published at ECCV, 2020


Paper: arxiv.org/abs/2007.12146

Project Page: yashkant.github.io/projects/sam-textvqa

We propose a novel spatially aware self-attention layer such that each visual entity only looks at neighboring entities defined by a spatial graph and use it to solve TextVQA.

Repository Setup

Create a fresh conda environment, and install all dependencies.

conda create -n sam python=3.6
conda activate sam
cd sam-textvqa
pip install -r requirements.txt

Install pytorch

conda install pytorch torchvision cudatoolkit=10.0 -c pytorch

Finally, install apex from: https://github.com/NVIDIA/apex

Data Setup

Download files from the dropbox link and place it in the data/ folder. Ensure that data paths match the directory structure provided in data/README.md

Run Experiments

From the below table pick the suitable configuration file:

Method context (c) Train splits Evaluation Splits Config File
SA-M4C 3 TextVQA TextVQA train-tvqa-eval-tvqa-c3.yml
SA-M4C 3 TextVQA + STVQA TextVQA train-tvqa_stvqa-eval-tvqa-c3.yml
SA-M4C 3 STVQA STVQA train-stvqa-eval-stvqa-c3.yml
SA-M4C 5 TextVQA TextVQA train-tvqa-eval-tvqa-c5.yml

To run the experiments use:

python train.py \
--config config.yml \
--tag experiment-name

To evaluate the pretrained checkpoint provided use:

python train.py \
--config configs/train-tvqa_stvqa-eval-tvqa-c3.yml \
--pretrained_eval data/pretrained-models/best_model.tar

Note: The beam-search evaluation is undergoing changes and will be updated.

Resources Used: We ran all the experiments on 2 Titan Xp gpus.

Citation

@inproceedings{kant2020spatially,
  title={Spatially Aware Multimodal Transformers for TextVQA},
  author={Kant, Yash and Batra, Dhruv and Anderson, Peter 
          and Schwing, Alexander and Parikh, Devi and Lu, Jiasen
          and Agrawal, Harsh},
  booktitle={ECCV}
  year={2020}}

Acknowledgements

Parts of this codebase were borrowed from the following repositories:

We thank Abhishek Das, Abhinav Moudgil for their feedback and Ronghang Hu for sharing an early version of his work. The Georgia Tech effort was supported in part by NSF, AFRL, DARPA, ONR YIPs, ARO PECASE, Amazon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.

License

MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].