All Projects → roatienza → deep-text-recognition-benchmark

roatienza / deep-text-recognition-benchmark

Licence: Apache-2.0 License
PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to deep-text-recognition-benchmark

LaTeX-OCR
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Stars: ✭ 1,566 (+1173.17%)
Mutual labels:  ocr, vision-transformer
ibm-cloud-functions-serverless-ocr-openchecks
Serverless bank check deposit processing with object storage and optical character recognition using Apache OpenWhisk powered by IBM Cloud Functions. See the Tech Talk replay for a demo.
Stars: ✭ 40 (-67.48%)
Mutual labels:  ocr
gazou
Japanese OCR for Linux & Windows
Stars: ✭ 32 (-73.98%)
Mutual labels:  ocr
GFNet
[NeurIPS 2021] Global Filter Networks for Image Classification
Stars: ✭ 199 (+61.79%)
Mutual labels:  vision-transformer
R2CNN
caffe re-implementation of R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection
Stars: ✭ 80 (-34.96%)
Mutual labels:  ocr
craft-text-detector
Packaged, Pytorch-based, easy to use, cross-platform version of the CRAFT text detector
Stars: ✭ 151 (+22.76%)
Mutual labels:  ocr
MLKit
🌝 MLKit是一个强大易用的工具包。通过ML Kit您可以很轻松的实现文字识别、条码识别、图像标记、人脸检测、对象检测等功能。
Stars: ✭ 294 (+139.02%)
Mutual labels:  ocr
go-ocr
A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.
Stars: ✭ 31 (-74.8%)
Mutual labels:  ocr
Splice
Official Pytorch Implementation for "Splicing ViT Features for Semantic Appearance Transfer" presenting "Splice" (CVPR 2022)
Stars: ✭ 126 (+2.44%)
Mutual labels:  vision-transformer
cordova-plugin-tesseract
Cordova Plugin for OCR process using Tesseract
Stars: ✭ 70 (-43.09%)
Mutual labels:  ocr
ScreencapToTextBot
Reddit bot that takes the screencap of a conversation and converts it in reddit formatted text
Stars: ✭ 12 (-90.24%)
Mutual labels:  ocr
pdf-scripts
📑 Scripts to repair, verify, OCR, compress, wrangle, crop (etc.) PDFs
Stars: ✭ 33 (-73.17%)
Mutual labels:  ocr
proxy-scrape
scrapin' proxies with ocr
Stars: ✭ 20 (-83.74%)
Mutual labels:  ocr
MPViT
MPViT:Multi-Path Vision Transformer for Dense Prediction in CVPR 2022
Stars: ✭ 193 (+56.91%)
Mutual labels:  vision-transformer
doctr-tfjs-demo
Javascript demo of docTR, powered by TensorFlowJS
Stars: ✭ 21 (-82.93%)
Mutual labels:  ocr
PASSL
PASSL包含 SimCLR,MoCo v1/v2,BYOL,CLIP,PixPro,BEiT,MAE等图像自监督算法以及 Vision Transformer,DEiT,Swin Transformer,CvT,T2T-ViT,MLP-Mixer,XCiT,ConvNeXt,PVTv2 等基础视觉算法
Stars: ✭ 134 (+8.94%)
Mutual labels:  vision-transformer
Combining-EfficientNet-and-Vision-Transformers-for-Video-Deepfake-Detection
Code for Video Deepfake Detection model from "Combining EfficientNet and Vision Transformers for Video Deepfake Detection" available on Arxiv and was submitted to ICIAP 2021.
Stars: ✭ 39 (-68.29%)
Mutual labels:  vision-transformer
lookup
🔍 Pure Go implementation of fast image search and simple OCR, focused on reading info from screenshots
Stars: ✭ 35 (-71.54%)
Mutual labels:  ocr
granblue-automation-android
Educational application written in Kotlin aimed at automating user-defined workflows for the mobile game, "Granblue Fantasy", using MediaProjection, AccessibilityService, and OpenCV.
Stars: ✭ 26 (-78.86%)
Mutual labels:  ocr
python-ocr-example
The code for the blogpost A Python Approach to Character Recognition
Stars: ✭ 54 (-56.1%)
Mutual labels:  ocr

Vision Transformer for Fast and Efficient Scene Text Recognition

ViTSTR is a simple single-stage model that uses a pre-trained Vision Transformer (ViT) to perform Scene Text Recognition (ViTSTR). It has a comparable accuracy with state-of-the-art STR models although it uses significantly less number of parameters and FLOPS. ViTSTR is also fast due to the parallel computation inherent to ViT architecture.

Paper

ViTSTR Model

ViTSTR is built using a fork of CLOVA AI Deep Text Recognition Benchmark. Below we document how to train and evaluate ViTSTR-Tiny and ViTSTR-small.

Install requirements

pip3 install -r requirements.txt

Inference

python3 infer.py --image demo_image/demo_1.png \
--model https://github.com/roatienza/deep-text-recognition-benchmark/releases/download/v0.1.0/vitstr_small_patch16_224_aug_infer.pth

Replace --image by the path to your target image file.

After the model has been downloaded, you can perform inference using the local checkpoint:

python3 infer.py --image demo_image/demo_2.jpg --model vitstr_small_patch16_224_aug_infer.pth

Sample Results:

Input Image Output Prediction
demo_1 Available
demo_2 SHAKESHACK
demo_3 Londen
demo_4 Greenstead

Dataset

Download lmdb dataset from CLOVA AI Deep Text Recognition Benchmark.

Quick validation using a pre-trained model

ViTSTR-Small

CUDA_VISIBLE_DEVICES=0 python3 test.py --eval_data data_lmdb_release/evaluation \
--benchmark_all_eval --Transformation None --FeatureExtraction None \
--SequenceModeling None --Prediction None --Transformer \
--sensitive --data_filtering_off  --imgH 224 --imgW 224 \
--TransformerModel=vitstr_small_patch16_224 \ 
--saved_model https://github.com/roatienza/deep-text-recognition-benchmark/releases/download/v0.1.0/vitstr_small_patch16_224_aug.pth

Available model weights:

Tiny Small Base
vitstr_tiny_patch16_224 vitstr_small_patch16_224 vitstr_base_patch16_224
ViTSTR-Tiny ViTSTR-Small ViTSTR-Base
ViTSTR-Tiny+Aug ViTSTR-Small+Aug ViTSTR-Base+Aug

Benchmarks (Top 1% accuracy)

Model IIIT SVT IC03 IC03 IC13 IC13 IC15 IC15 SVTP CT Acc Std
3000 647 860 867 857 1015 1811 2077 645 288 % %
TRBA (Baseline) 87.7 87.4 94.5 94.2 93.4 92.1 77.3 71.6 78.1 75.5 84.3 0.1
ViTSTR-Tiny 83.7 83.2 92.8 92.5 90.8 89.3 72.0 66.4 74.5 65.0 80.3 0.2
ViTSTR-Tiny+Aug 85.1 85.0 93.4 93.2 90.9 89.7 74.7 68.9 78.3 74.2 82.1 0.1
ViTSTR-Small 85.6 85.3 93.9 93.6 91.7 90.6 75.3 69.5 78.1 71.3 82.6 0.3
ViTSTR-Small+Aug 86.6 87.3 94.2 94.2 92.1 91.2 77.9 71.7 81.4 77.9 84.2 0.1
ViTSTR-Base 86.9 87.2 93.8 93.4 92.1 91.3 76.8 71.1 80.0 74.7 83.7 0.1
ViTSTR-Base+Aug 88.4 87.7 94.7 94.3 93.2 92.4 78.5 72.6 81.8 81.3 85.2 0.1

Comparison with other STR models

Accuracy vs Number of Parameters

Acc vs Parameters

Accuracy vs Speed (2080Ti GPU)

Acc vs Speed

Accuracy vs FLOPS

Acc vs FLOPS

Train

ViTSTR-Tiny without data augmentation

RANDOM=$$

CUDA_VISIBLE_DEVICES=0 python3 train.py --train_data data_lmdb_release/training \
--valid_data data_lmdb_release/evaluation --select_data MJ-ST \
--batch_ratio 0.5-0.5 --Transformation None --FeatureExtraction None \ 
--SequenceModeling None --Prediction None --Transformer \
--TransformerModel=vitstr_tiny_patch16_224 --imgH 224 --imgW 224 \
--manualSeed=$RANDOM  --sensitive

Multi-GPU training

ViTSTR-Small on a 4-GPU machine

It is recommended to train larger networks like ViTSTR-Small and ViTSTR-Base on a multi-GPU machine. To keep a fixed batch size at 192, use the --batch_size option. Divide 192 by the number of GPUs. For example, to train ViTSTR-Small on a 4-GPU machine, this would be --batch_size=48.

python3 train.py --train_data data_lmdb_release/training \
--valid_data data_lmdb_release/evaluation --select_data MJ-ST \
--batch_ratio 0.5-0.5 --Transformation None --FeatureExtraction None \
--SequenceModeling None --Prediction None --Transformer \
--TransformerModel=vitstr_small_patch16_224 --imgH 224 --imgW 224 \
--manualSeed=$RANDOM --sensitive --batch_size=48

Data augmentation

ViTSTR-Tiny using rand augment

It is recommended to use more workers (eg from default of 4, use 32 instead) since the data augmentation process is CPU intensive. In determining the number of workers, a simple rule of thumb to follow is it can be set to a value between 25% to 50% of the total number of CPU cores. For example, for a system with 64 CPU cores, the number of workers can be set to 32 to use 50% of all cores. For multi-GPU systems, the number of workers must be divided by the number of GPUs. For example, for 32 workers in a 4-GPU system, --workers=8. For convenience, simply use --workers=-1, 50% of all cores will be used. Lastly, instead of using a constant learning rate, a cosine scheduler improves the performance of the model during training.

Below is a sample configuration for a 4-GPU system using batch size of 192.

python3 train.py --train_data data_lmdb_release/training \
--valid_data data_lmdb_release/evaluation --select_data MJ-ST \
--batch_ratio 0.5-0.5 --Transformation None --FeatureExtraction None \
--SequenceModeling None --Prediction None --Transformer \
--TransformerModel=vitstr_tiny_patch16_224 --imgH 224 --imgW 224 \
--manualSeed=$RANDOM  --sensitive \
--batch_size=48 --isrand_aug --workers=-1 --scheduler

Test

ViTSTR-Tiny. Find the path to best_accuracy.pth checkpoint file (usually in saved_model folder).

CUDA_VISIBLE_DEVICES=0 python3 test.py --eval_data data_lmdb_release/evaluation \
--benchmark_all_eval --Transformation None --FeatureExtraction None \
--SequenceModeling None --Prediction None --Transformer \
--TransformerModel=vitstr_tiny_patch16_224 \
--sensitive --data_filtering_off  --imgH 224 --imgW 224 \
--saved_model <path_to/best_accuracy.pth>

Citation

If you find this work useful, please cite:

@inproceedings{atienza2021vision,
  title={Vision transformer for fast and efficient scene text recognition},
  author={Atienza, Rowel},
  booktitle={International Conference on Document Analysis and Recognition},
  pages={319--334},
  year={2021},
  organization={Springer}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].