Vision Transformer for Fast and Efficient Scene Text Recognition

Apr 19, 2022

Vision Transformer for Fast and Efficient Scene Text Recognition

Contents

Introduction

This paper introduces an efficient algorithm based on ViT network topology for Scene Text Recognition task. Unlike other papers in this area authors focused not only on accuracy metric but also investigates a trade-off possibilities between: accuracy vs memory vs speed.

Here augmentation models were trained with RandAugment: https://arxiv.org/pdf/1909.13719.pdf

Authors explained the need of wide range of augmentations by variety of samples in benchmark datasets:

The base model used in this paper is simple ViT pre-trained with knowledge distillation technique introduced in DeiT for better performance.

Related works

In related works section authors discussed about current SOTA in Text Recognition area mainly comparing proposed model to multi-staged Text Recognizers:

Model Architecture

The model proposed in this paper (ViTSTR) consists of only one stage - Encoder. As the task is a Text Recognition the input is assumed to be already cropped text image:

The input image is the separated into 16x16 patches later reshaped to 1D vector embeddings.

The only difference of ViTSTR from ViT model is prediction head part: instead of single class prediction head ViTSTR predicts several characters with correct sequence, order and length.

For start and end tracing: ViTSTR uses: [GO] and [s] symbols.

In all layers constant D size of embeddings/features is used. Input embeddings are converted to size D using a Linear Projection layers.

The topology displayed on the picture can be explained by formulas as:

Different model configuration are presented in next table:

Experiment results

Implementation and evaluation were carried out on the NAVER text recognition framework: [’What is wrong with scene text recognition model comparisons?’]

Datasets used for training (both synthetic dataset):

MJSynth (MJ)

SynthText (ST)

Datasets used for testing (real image datasets):

ICDAR 03/13/15

SVT/SVTP

IIIT5K

Training conditions:

DeiT pre-trained model used with no parameters frozen during training.

Accuracy comparison:

Performance and tradeoff comparison:

Augmentation examples:

Failure cases:

Proposed model progress

ViTSTR currently is on 11-th place on global ranking of Scene Text Recognition algorithms.

But ViTSTR model results were significantly improved in:

Why You Should Try the Real Data for the Scene Text Recognition - paper

https://arxiv.org/pdf/2107.13938v1.pdf

The results from this paper shows that Text Recognition model performance can be easily improved up to 4-5%by adding of a real data to training set. In this papers adding OpenImagesV5 dataset to training split improved VitSTR-Tiny accuracy on IIIT5K by 6.6% simply by using OpenImagesV5 as 20% of training split: