SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

Oct 29, 2021

SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

Contents

Introduction Proposed Method Experiment Conclusions

CVPR2020논문입니다. 이전에 ocr-recognizer를 구현하면서 다음에 어떤 feature를 추가할지 확인하던 차에 보게 되었습니다.

Introduction

scene text recognition에서 다양한 perspective의 distortion 그리고 curved shape은 encoder-decoder 구조를 통해 해결할 수 있었습니다. 하지만 빛이나 블러, 불완전한 글자 등에는 아직 해결해야할 것들이 많습니다.

Low-quality scene text를 robust하게 인식하도록 semantic enhanced encoder-decoder구조를 제안합니다. 기본적인 구조는 ASTER의 방법에 제안하는 방식을 통합시켰습니다.

이를 통해 주장하는 main contribution은 다음과 같습니다.

SEED(semantic enhanced encoder decoder) 구조를 제안

다양한 구조와 결합될 수 있음을 ASTER와의 결합을 통해 보임

다양한 bechmark에서(특히 low quality dataset인 ICDAR2015, SVT-Perspective에서는 큰 차이)로 state-of-the-art를 달성

Proposed Method

1. Encoder-Decoder Framework

1) Plain Encoder-Decoder Framework

encoder가 context vector C를 추출하고 이는 input의 global information을 내포합니다. decoder는 추출된 context vector를 원하는 target-output으로 변환합니다. 하지만 context vector가 input으로 한정된다는 단점이 있습니다.

2) Attention-Based Encoder-Decoder Framework

attention mechanism은 decoder가 각 decoding step에 따라서 적절한 context를 선택하도록 할수 있게 해줍니다. 하지만 이는 long range dependency problem이 생길 수 있습니다. encoder와 decoder 사이의 alignment는 weakly supervised 방식으로 학습이 이뤄지게 됩니다.

3) Proposed Encoder-Decoder Framework

scene text recognition에서 위 두가지 방식은 decoder가 오직 local visual feature에 의존하게 됩니다. global feature가 없이 인식을 하기 때문에 low-quality image에서 잘 동작하지 않습니다. 제안하는 framework에서는 encoder가 명시적인 global semantic feature를 학습하고 이를 decoder에서 활용하도록 합니다. 학습을 위한 global semantic feature는 FastText를 통해서 word embedding으로 만들어 냅니다.

2. FastText model

skip-gram에 기반한 FastText를 pre-trained language model로 선정했습니다.

*skip-gram: word2vec는 onehot encoding 대신 vectorize를 적용하면서 단어의 의미를 공간상에 벡터화 하는 방식이다.이를 CBOW나 skip-gram 방식의 모델로 수행한다.

continuous bag-of-words (CBOW) 그리고 skip-gram. CBOW는 word vectors를 이용하여 주어진 컨텍스트 상의 중앙에 위치한 단어를 예측하는 것을 목표로 한다. 반대로, Skip-gram은 중앙 단어로부터 컨텍스트 단어들의 분포-distribution- (probability)를 예측하려 한다. http://solarisailab.com/archives/959

어떤 문장 내의 text corpus(말뭉치) 가 있습니다. 은 window의 크기를 정하는 hyper-parameter입니다(논문에서는 문장의 길이라고 했으나 다름). 단어 가 embedding vector 로 표현될 때 skip-gram은 neural network에 이 단어를 입력했을 때 을 예측하는 것을 목적으로 학습합니다. 이 embedding vector는 학습을 통해서 최적화됩니다.

FastText에서는 subword를 embed합니다. 어떤 단어의 subword는 hyperparameter 과 에 의해서 정해집니다. 단어 where에 대해 , 로 정의하면 subword는 {wh, he, er, re, whe, her, ere, wher, here}가 됩니다. 단어의 representation은 이러한 subword의 embedding vector의 combination으로 만들어집니다. 이를 통해서 Out of vocabulary 문제를 피할 수 있게 됩니다.

3. SEED

1) General Framework

위에서 semantic feature를 활용한 framework는 아래 그림과 같습니다. attention 기반의 encoder-decoder구조와 다른 점은 제안된 semantic module이 extra semantic information을 예측한다는 점입니다. 이때 pre-trained language model을 사용한 supervision을 통해 성능을 향상시킬 수 있습니다. 이를 통해서 low-quality image에서 robust한 인식이 가능하고 recognition mistake를 줄일 수 있습니다.

2) Architecture of Semantics Enhanced ASTER

ASTER를 제안하는 framework를 활용하는 예시로 삼았으며 해당 구조를 SE-ASTER라고 하겠습니다. SE-ASTER는 네가지 모듈로 구성됩니다.

Rectification module: thine plate spline transformation을 control point를 예측해서 적용

Encoder: ResNet-45와 2개의 BiLSTM(256 hidden units)로 구성되며 output인 feature sequence 로 L은 feature map의 width이고 C는 depth일 때 의 shape를 가집니다.

Semantic module: feature sequnce h를 flatten하여 입력 로 사용합니다. semantic information S는 다음 두 linear function으로 얻어집니다. activation function은 relu를 사용하였고 와 는 trainable한 parameter입니다.

Decoder: 한개의 GRU layer로 구성된 Bahdanau-Attention mechanism을 사용하였습니다. 기존 ASTER와는 다르게 single direction decoder만을 사용했습니다(기존 ASTER에서는 left-to-right, right-to-left로 모두 학습하는 방식을 적용). semantic information 는 GRU의 initial state로 사용됩니다. 기존에 zero-state로 initialize하는 대신 decoding process에서 global semantic으로 guide를 줄 수 있습니다.

4. Loss function and training strategy

semantic module과 decoder module 모두에 supervision을 주어 loss function은 아래와 같습니다.

은 cross entropy loss로서 예측된 확률과 ground truth를 기반으로 계산됩니다.

은 cosine embedding loss를 semantic information에 적용하였습니다.

balance를 위한 hyperparameter 는 1로 사용했습니다.

는 predicted semantic information이고 은 pre-trained FastText에서 얻은 word-embedding입니다.

학습전략에는 두가지가 있습니다.

decoder의 state(start token)를 FastText에서 추출한 embedding으로 initialize하는 방식

예측된 semantic information을 사용하는 방식

두가지를 모두 활용해본 결과 성능은 유사하여 두번째 방식을 택했습니다.

Experiment

1. Datasets

IIIT5K-Words

SVT(Streed View Text)

ICDAR2013

ICDAR2015

CUTE80(CUTE)

Synth90K

SynthText

2. Implementation details

제안된 SE-ASTER는 pyTorch로 구현하였고 pre-trained FastText는 공식적으로 사용가능한 모델을 썼습니다. 97가지의 symbol을 예측하도록 했습니다. 입력이미지의 크기는 64x256입니다. optimizer는 ADADELTA를 사용했고 pre-training이나 data-augmentation은 사용하지 않았습니다. learning rate는 1로 시작해서 0.1(4 epoch), 0.01(5epoch)로 decay하였습니다. SynthText와 Synth90K에 대해 총 6 epoch로 학습하였습니다.

test시에는 학습 때와 같은 크기로 resize를하고 GRU decoding을 위해서 beam search를 사용했습니다. 이는 k개의 candidate를 뽑아 누적된 확률이 가장 높은 것을 뽑는 방식입니다. 본 실험에서는 k를 5로 사용했습니다.

3. Ablation study

1) Decoder의 state initializing 방식

WES: word embedding supervision, word embedding을 예측하도록 subtask만 붙인

INIT: GRU의 initial state를 predicted semantic information으로 가져가는 것.

4. Performance with inaccurate bounding boxes

end-to-end 성능을 높이기 위한 방식을 검증하기 위해서 실험했습니다. ICDAR13과 ICDAR15를 랜덤하게 최대 15%까지 상하좌우를 잘라내었을 때 성능을 검증했습니다. 이때 crop된 이미지는 원본 이미지와의 IoU가 0.49이상이어야 합니다. 이는 detection evaluation protocol에서 사용하는 IoU threshold인 0.5보다 높은 수치입니다.