Vision Transformer for Fast and Efficient Scene Text Recognition

Inc Lomin's avatar
Apr 19, 2022
Vision Transformer for Fast and Efficient Scene Text Recognition

Introduction

This paper introduces an efficient algorithm based on ViT network topology for Scene Text Recognition task. Unlike other papers in this area authors focused not only on accuracy metric but also investigates a trade-off possibilities between: accuracy vs memory vs speed.
notion image
 
notion image
 
Here augmentation models were trained with RandAugment: https://arxiv.org/pdf/1909.13719.pdf
Authors explained the need of wide range of augmentations by variety of samples in benchmark datasets:
notion image
 
The base model used in this paper is simple ViT pre-trained with knowledge distillation technique introduced in DeiT for better performance.
 
In related works section authors discussed about current SOTA in Text Recognition area mainly comparing proposed model to multi-staged Text Recognizers:
notion image
 

Model Architecture

The model proposed in this paper (ViTSTR) consists of only one stage - Encoder. As the task is a Text Recognition the input is assumed to be already cropped text image:
The input image is the separated into 16x16 patches later reshaped to 1D vector embeddings.
The only difference of ViTSTR from ViT model is prediction head part: instead of single class prediction head ViTSTR predicts several characters with correct sequence, order and length.
For start and end tracing: ViTSTR uses: [GO] and [s] symbols.
notion image
In all layers constant D size of embeddings/features is used. Input embeddings are converted to size D using a Linear Projection layers.
The topology displayed on the picture can be explained by formulas as:
 
notion image
notion image
notion image
 
Different model configuration are presented in next table:
notion image
 

Experiment results

Implementation and evaluation were carried out on the NAVER text recognition framework: [’What is wrong with scene text recognition model comparisons?’]
Datasets used for training (both synthetic dataset):
  • MJSynth (MJ)
  • SynthText (ST)
Datasets used for testing (real image datasets):
  • ICDAR 03/13/15
  • SVT/SVTP
  • CT
  • IIIT5K
 
Training conditions:
notion image
 
DeiT pre-trained model used with no parameters frozen during training.
 
Accuracy comparison:
notion image
 
Performance and tradeoff comparison:
notion image
 
Augmentation examples:
notion image
 
Failure cases:
notion image

Proposed model progress

ViTSTR currently is on 11-th place on global ranking of Scene Text Recognition algorithms.
But ViTSTR model results were significantly improved in:
 
The results from this paper shows that Text Recognition model performance can be easily improved up to 4-5%by adding of a real data to training set. In this papers adding OpenImagesV5 dataset to training split improved VitSTR-Tiny accuracy on IIIT5K by 6.6% simply by using OpenImagesV5 as 20% of training split:
notion image
 

Conclusions

  • Even one-stage recognizer can achieve pretty high results in STR
  • ViT for STR task is more efficient in terms of performance
  • Real data in training set is really important → worth to try!
Share article