Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features
This paper introduces a novel method, called Multi-modAl Text Recognition Network (MATRN), that enables interactions between visual and semantic features for better recognition performances.
Apr 20, 2022
Introduction
This paper introduces a novel method, called Multi-modAl Text Recognition Network (MATRN), that enables interactions between visual and semantic features for better recognition performances.
Main paper contribution are claimed to be:
- We explore the combinations of visual and semantic features, identified by VM and LM, and prove their benefits. To the best of our knowledge, multi-modal feature enhancements with bi-directional fusions are novel components, that have never been explored.
- We propose a new STR method, named MATRN, that contains three major components, spatial encoding for semantics, multi-modal feature enhancements, and visual clue masking strategy, for better combinations of two modalities. Thanks to the effective contributions of the proposed components, MATRN achieves state-of-the-art performances on seven STR benchmarks.
- We provide empirical analyses that illustrate how our components improve STR performances as well as how MATRN contributes to the existing challenges.
The development of STR algorithms can be represented by few approaches:
- Models like ViTSTR mainly focused on Vision features processing module without explicitly modeling and training Language module.
- Next approaches like SRN and ABINet attempted to utilize a separate Language model with subsequent fusing of Visual and Semantic features for final sequence prediction.
- Approaches like Bhunia et al. proposed a multi-stage decoder referring to visual features multiple times to enhance semantic features.
- Finally VisionLAN proposed a language-aware visual mask that refers to semantic features for enhancing the visual features. Given a masking position of the word, the masking module occludes corresponding visual feature maps of the character region at the training phase.
- Inspired by Bhunia et al. [3] and VisionLAN [25], authors proposed the multiple combinations of multi-modal processes and propose the framework that can bidirectionally refine visual and semantic features by referring to each other.
MARTN Architecture
MARTN network topology:
Visual feature extraction consists of:
- ResNet with 45 layers
- Transformer units
On the next step Seed Text Generation for Language Model transcribes visual features to seed text:
Language model consists of 4 Transformer decoder blocks. Language Model is initialized with the
weights, pre-trained on WikiText-103 in ABINet:
Then Language Model output is fused with Spatial Encoding to Semantics (SES):
After Visual Features (after Visual Clue Masking) and Enhanced Femantic Features are processed separately they are sent to Multi-Modal Feature Enhancement Module:
On the last stage Enhanced Visual and Semantic features are fused into final output sequence. Here Visual Features are transcribed into Semantic Sequence in the same way as in Seed Text Generator:
Training objective:
where M ins number of iterations.
Visual Clue Masking
Same way as VisionLAN authors utilized a VCM startegy where Visual to Semantic attention map is used for masking a specific visual features related to a particular character in the sequence. This forces model to learn more semantics to compensate missing information.
This strategy is used only during training. In order to reduce this difference influence on inference stage in 10% cases visual features are remained unchanged during training.
Experiment Results
Conclusions
- Even if not in Papers with code Benchmark results are comparable to S-GTR performance
- Seems to have an outstanding performance for partial occlusion handling
- Considerable speed performance <50 ms (image size???)
- STR improvement trends seems to include Language model recently and iterative improvements.
Share article