Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

This paper introduces a novel method, called Multi-modAl Text Recognition Network (MATRN), that enables interactions between visual and semantic features for better recognition performances.

Inc Lomin

Apr 20, 2022

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Contents

Introduction

This paper introduces a novel method, called Multi-modAl Text Recognition Network (MATRN), that enables interactions between visual and semantic features for better recognition performances.

Main paper contribution are claimed to be:

We explore the combinations of visual and semantic features, identified by VM and LM, and prove their benefits. To the best of our knowledge, multi-modal feature enhancements with bi-directional fusions are novel components, that have never been explored.

We propose a new STR method, named MATRN, that contains three major components, spatial encoding for semantics, multi-modal feature enhancements, and visual clue masking strategy, for better combinations of two modalities. Thanks to the effective contributions of the proposed components, MATRN achieves state-of-the-art performances on seven STR benchmarks.

We provide empirical analyses that illustrate how our components improve STR performances as well as how MATRN contributes to the existing challenges.

Related Works

The development of STR algorithms can be represented by few approaches:

Models like ViTSTR mainly focused on Vision features processing module without explicitly modeling and training Language module.

Next approaches like SRN and ABINet attempted to utilize a separate Language model with subsequent fusing of Visual and Semantic features for final sequence prediction.

Approaches like Bhunia et al. proposed a multi-stage decoder referring to visual features multiple times to enhance semantic features.

Finally VisionLAN proposed a language-aware visual mask that refers to semantic features for enhancing the visual features. Given a masking position of the word, the masking module occludes corresponding visual feature maps of the character region at the training phase.

Inspired by Bhunia et al. [3] and VisionLAN [25], authors proposed the multiple combinations of multi-modal processes and propose the framework that can bidirectionally refine visual and semantic features by referring to each other.

MARTN Architecture

MARTN network topology:

Visual feature extraction consists of:

ResNet with 45 layers

Transformer units

On the next step Seed Text Generation for Language Model transcribes visual features to seed text:

Language model consists of 4 Transformer decoder blocks. Language Model is initialized with the weights, pre-trained on WikiText-103 in ABINet:

Then Language Model output is fused with Spatial Encoding to Semantics (SES):

After Visual Features (after Visual Clue Masking) and Enhanced Femantic Features are processed separately they are sent to Multi-Modal Feature Enhancement Module:

On the last stage Enhanced Visual and Semantic features are fused into final output sequence. Here Visual Features are transcribed into Semantic Sequence in the same way as in Seed Text Generator:

Training objective:

where M ins number of iterations.

Visual Clue Masking

Same way as VisionLAN authors utilized a VCM startegy where Visual to Semantic attention map is used for masking a specific visual features related to a particular character in the sequence. This forces model to learn more semantics to compensate missing information.

This strategy is used only during training. In order to reduce this difference influence on inference stage in 10% cases visual features are remained unchanged during training.

Experiment Results

Conclusions

Even if not in Papers with code Benchmark results are comparable to S-GTR performance

Seems to have an outstanding performance for partial occlusion handling

Considerable speed performance <50 ms (image size???)

STR improvement trends seems to include Language model recently and iterative improvements.

See more posts

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

April 19, 2022

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Introduction

Related Works

MARTN Architecture

Visual Clue Masking

Experiment Results

Conclusions

More articles

QuadTree Attention for Vision Transformers

TableFormer: Table Structure Understanding with Transformers.

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer