AON: Towards Arbitrarily-Oriented Text Recognition

May 10, 2019

AON: Towards Arbitrarily-Oriented Text Recognition

Contents

Introduction Proposed Method Experiment Conclusions

Text Recognition 논문들은 regular text를 왼쪽에서 오른쪽으로 읽는 것을 가정합니다. 하지만 Scene Text데이터셋에 orientation이 일정하지 않은 데이터가 늘어나면서 이를 해결하고자 하는 논문입니다.

Introduction

이전의 방식들

지금까지 irregular texts를 위한 방법은 다음과 같았습니다.

1) attention에 기반한 spatial transformer network (STN)

rectify irregular text

사람이 geometircal label이 있지 않는 이상 optimize가 어려움

ex. thin-plate-spline (TPS)-based STN은 initialization pattern이 주어져야함.

2) auxiliary dense character detection task를 추가

Learning to Read Irregular Text with Attention Mechanisms

character-level bounding box annotations을 통해서 multi task learning을 함.

Show Attend and Tell을 통해서 attention은 2D feature을 selection할 수 있는 잠재력을 볼 수 있지만 irregular text image를 바로 모델에 학습할 경우 잘 안됬습니다.

위의 방식들은 아래의 과정을 거치며 left to right를 default로 가정하고 다음의 두단계를 거쳐 텍스트를 인식합니다.

1) text-image를 1D sequence로 encode.

2) character로 decode

해결하고자 하는 문제

실제 case에서는 이미지가 아래와 같이 읽는 방향이 left to right가 아닌 경우를 볼 수 있습니다.

위의 여러 방향에 대해 대처하기 위해서 input이미지를 아래의 네가지로 encode하는 방식을 제안합니다.(width, heigth는 1인 sequence)

left → right: horizontal features

right → left: reversed horizontal features

top → bottom: vertical features

bottom → top: reversed vertical features

이렇게 나온 네가지 feature를 direction weight와 곱하여 사용합니다.

weights를 character placement clues라고 하며 c1, c2, c3, c4입니다.

이후 FG(filter gate)는 visual representation이 통합된 feature sequence를 만들어내게됩니다.

위와 같이 four-direction feature extraction network와 the clues extraction network를 통합하여 arbitrary orientation network(AON)을 제안합니다.

Contribution

네 방향의 feature와 문자의 배치에 대한 정보를 추출하기 위한 Arbitrary orientation network(AON)을 제안합니다.

네 방향의 feature와 학습된 배치 정보를 합치기 위한 Filter gate(FG)를 제안합니다.

AON과 FG, attention-based decoder를 통합하여 문자 단위의 annotation이 없이 end-to-end로 학습할 수 있습니다.

irregular와 regular text benchmark를 학습 및 실험하여 irregular benchmark에서 state-of-the-art를 달성하였고 regular benchmark에서는 비교할만한 성능을 보여주었습니다.

Proposed Method

The Framework

1) Basal Convolutional Neural Network (BCNN)

CNN층으로 image feature를 뽑아내는 역할을 합니다.

2) Multi-Direction Feature Extraction Module

AON과 FG를 포함하는 모듈입니다.

AON은 arbitrary oriented text feature와 character placement의 정보를 얻어 내기 위해 사용되었습니다.

FG는 multi direction feature를 character placement 정보를 사용하여 결합합니다.

3) Attention-based Decoder

Decoder로는 RNN을 사용하여 target sequence를 만들어 냅니다.

y_t=softmax(W^Ts_t)

W^t: parameter

s_t: t일 때 RNN의 hidden state

s_t =RNN(y_{t-1}, g_t, s_{t-1})

g_t는 sequential feature vector(의 weight sum입니다.

g_t=\sum_{j=1}^L\alpha_{t,j},\hat{h}_j

α_t는 attention weight의 verctor로 feature vector와 hidden state로 계산됩니다.

\alpha_t=Attend(s_{t-1}, \hat\mathcal{H})

Technical Details of AON and FG

1) Arbitrary Orientation Network (AON)

AON은 크게 horizontal network(HN, 좌측 노란색)와 vertical network(VN, 가운데 노란색) 그리고 character placement clue network(CN, 우측 녹색)으로 구성되어있습니다.

H: horizontal feature

V: vertical feature

H,V feature map을 좌우로 반전하여 left→right, right→left와 top→bottom, bottom→top의 feature로 만듭니다.

HN과 VN을 각각 분리된 네트워크로 동시에 학습할 경우 training set의 orientation의 불균등으로 인해 학습이 잘안됩니다. 이를 해결하기 위해서 HN과 VN간의 Shared Convolution을 사용했습니다.

2) Filter Gate (FG)

filter gate를 도입해서 관련이 없는 feature는 무시하도록 설계하였습니다.

Experiment

Datasets

SVT-Perspective: google street view에서 side view angle만 모아 perspective distortion이 많습니다.

CUTE80: curved text만 모은 데이터셋입니다.

ICDAR2015

IIIT5K-Words

Street View Text

ICDAR2003

Implementation details

모든 이미지는 100x100으로 resize됩니다.

모든 convolution layer는 3x3의 kernel과 1x1 pad, 1x1 stride를 갖습니다.

모든 pooling(max) block은 2x2의 kernel을 갖습니다.

각 convolution layer에는 batch norm(BN)과 Relu activation이 붙습니다.

Text decode를 위한 LSTM은 256의 hidden layer와 37개의 output unit을 갖습니다.(letter: 26개, digits: 10개, EOS: 1개)

먼저 synthetic data에서 cropped text로 학습을 했습니다. 이때 이미지는 0도에서 360도로 랜덤하게 회전하였습니다.

Performance on Irregular Datasets

FAN은 text recognition의 performance를 올리기 위해서 character level bounding box를 필요로 합니다. Yang et al. 의 경우에도 character level bounding box를 필요로 하기 때문에 AON이 가치가 있다고 볼 수 있습니다.

아래의 세가지 baseline의 Naive_base는 Horizontal Network만을 일반적인 Text recognition model과 같이 사용하였고 STN_base는 이에 TPS를 붙인 것 입니다. 아래의 첫번째와 세번째를 제외한 네가지 경우 원하는대로 rectify되지 않았습니다.

Performance on Regular Datasets

AON은 irregular, regular text를 모두 인식하기 위해서 고안되었기 떄문에 regular text benchmark에 실험을 해보았습니다.

Deep insight into AON

1) The roles of HN,VN, CN and FG in AON.

네 방향의 placement clue를 통해서 텍스트가 어느 방향으로 진행되는지 알 수 있습니다. 아래 그림에서는 회색이 될 수록 placement clue가 활성화 됨을 볼 수 있습니다. 이를 통해서 CN의 효과를 눈으로 확인 할 수 있습니다.

2) Text placement trends generated with AON

attention module에서 만들어진 alignment factor α_t가 input sequence에 distribution probability를 glimpse vector g_t를 계산함으로서 알 수 있습니다.

\mathcal{C}=[c_1, c_2,c_3,c_4]

character placement clue인 C와 α_t로 image를 LxL의 patch로 나누고 character position distribution인 dis를 구합니다.

dis = \mathcal{C}\odot\alpha_t

이렇게 계산된 character들의 위치와 각 글자가 뒷 글자를 향하도록 arrow를 그리면 다음과 같은 그림을 볼 수 있습니다.

Discussion

The necessity of CN in AON

CN의 필요성을 확인하기 위해서 CN을 활용하지 않는 두가지 실험을 했습니다.

a) horizontal과 vertical feature를 channel에 따라서 concatenate했습니다.

→ 모델이 느리게 수렴하고 state-of-the-art의 성능을 달성할 수 없었습니다.

b) horizontal과 vertical feature를 temporal에 따라서 concatenate했습니다.

→ 모든 benchmark에서 4%정도 낮은 결과를 달성하였습니다.

Impact of aspect ratio

aspect ratio의 영향에 관해 실험해 보았지만 차이점이 없었습니다. 높이에 대한 확대 및 축소는 긴 horizontal text의 인식률에 큰 영향이 없었습니다.

Integrating with only two directional feature sequence

두가지 방향만을 사용했을 때 뒤집힌 경우에는 'p'와 'b'의 visual feature가 Filter gate에 의해 섞이게 됩니다.

Conclusions

arbitrary oriented network를 고안하여 네 방향의 feature와 character placement clue를 추출해내는 방식을 제안했습니다. Filter gate 방식을 도입하여 네 방향의 sequence를 결합하도록 하였습니다. 마지막에는 attention 기반의 decoder를 통해서 character로 만들어냅니다.