NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition

Mar 03, 2021

NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition

Contents

Introduction Methodology Experiments Conclusion

A Simple and Strong Convolutional-Attention Network for Scene Text Recognition

기존의 연구방법들은 recurrence와 convolution을 기반으로 구성되어 RNN의 느린 학습 속도와 long-term feature extraction을 위한 복잡한 Convolution network를 사용해야하는 단점이 있습니다. 이를 해결하기 위해서 no-recurrence sequence-to-sequnce text recognizer인 NRTR을 도입했습니다. NRTR은 encoder-decoder구조를 사용하여 encoder가 self-attention을 사용하여 image feature를 추출하고 decoder는 encoder의 output에서 text를 인식합니다. scene image는 text와 background의 variation이 큰 점을 고려해서 modality-transform block을 추가해서 효율적으로 2D Input image를 1D sequence로 변경할 수 있도록 했습니다. NRTR은 regular와 irregular text 모두에서 sota를 달성했으며 학습속도가 기존 방식보다 8배 빠릅니다.

Introduction

text recognition의 최근 연구는 대부분 sequence-to-sequence(seq2seq) 패러다임을 따르고 있습니다. 이러한 방식은 CNN과 RNN의 두 브랜치로 간단히 분류할 수 있습니다. 하지만 RNN은 태생적으로 parallel하게 연산할 수 없습니다. 학습 시에는 gradient vanishing/exploding 등의 문제가 생깁니다.

sequential computation을 가속하기 위해서 RNN 대신 CNN 기반의 recognizer를 제안한 방법들이 있으나 CNN은 떨어진 위치간의 관계를 학습하기 어렵고 이를 위해서 더 많은 layer를 쌓으면 복잡해지기 때문에 complexity와 performance 사이의 딜레마에 빠지게 됩니다.

Reading Scene Text with Attention Convolutional Sequence Modeling

Scene Text Recognition with Sliding Convolutional Character Models

NRTR에서는 Transformer에서 영감을 받아 self-attention mechanism에 기반합니다. Encoder-decoder 구조에서 Encoder는 self-attention을 통해 input imag sequence를 feature re-presentation으로 변환합니다. Decoder는 self-attention을 적용하여 encoder의 output을 character sequence로 변환합니다.

Show attention is all you need에서는 1D sequence를 input으로 사용한 것과 다르게 scene text recognizer는 2D image를 받습니다. modality-transform block을 pre-processing으로 encoder 이전에 적용하여 효율적으로 2D image를 sequnce로 변환합니다.

self-attention 구조 적용을 통해 parallel computation을 높이고 complexity를 줄임.

modality-transform block을 추가하여 연관된 sequence를 더 잘 추출할 수 있도록함.

각종 benchmark에서 sota 달성 및 training speed가 빠름.

Methodology

NRTR은 modality-transform block, encoder, decoder로 세가지 sub-network로 구성되있습니다.

Self-Attention Mechanism

Self-attention은 input과 output의 위치간 correlation 정보를 추출합니다. 여기서는 'attention is all you need'와 같이 Scaled Dot-Product Attention을 사용하였습니다. attention 연산은 query, key, value를 input으로 받습니다. dot product는 query와 모든 key사이에 유사도를 얻기 위해서 수행됩니다. softmax는 value에 대한 weight를 얻기 위해서 수행됩니다. query q와 모든 key(matrix K에 들어있음)와 value(matrix V에 들어있음)가 주어졌을 때 output value는 input value에 대해 weighted average를 취한 값입니다.

v^{out} = \text{softmax}(\frac{qK^t}{\sqrt{d_k}})V

\text{Attention Score}: qK^t

Modality-Transform Block

Modality transform block은 CNN Layer로 구성되어 각 레이어는 stride를 2 channel수는 2배씩 늘어나도록 했습니다. d_model(encoder-decoder model의 dimension)은 height와 channel의 곱과 같게 하고 고정하여 사용하였습니다. 마지막의 concatenate operation은 feature를 reshape해서 input sequence(input_step, dim)=(w0/2^n, d_model)로 만듭니다.

⇒ input step마다 height와 channel 방향의 feature로 구성. 따라서 마지막 n번재 CNN의 output channel은 d_model/(h0/2^n)이어야 합니다.

추가적으로 NRTR은 recurrence가 없기 때문에 positional encoding을 사용해서 sequence의 각 position을 알려줍니다.

RNN은 순차적으로 입력을 받아 위치 정보를 간접적으로 알 수 있으나 Transformer의 경우 각각이 분리되어 입력으로 들어오기 때문에 위치 정보가 사라집니다.

여기서 pos는 input image의 position을 의미하고 i는 i번째 dimention을 의미합니다. PE(pos,i)를 Encoder의 input sequence에 더해줍니다.

Encoder

Encoder는 N_e개의 동일한 encoder block으로 구성되어있습니다. encoder block은 multi-head scaled dot-product attention과 position-wise fully connected network 두개의 sub-layer로 구성되어 있습니다.

multi-head는 scaled dot-product attention를 통해서 각 position에 따른 정보에 attention을 줄 수 있습니다. 또한 Convolution의 filter와 같이 multi-head attention의 head의 수 h를 조절할 수 잇습니다. multi-head scaled dot-product attention은 다음 세가지 연산으로 구성됩니다.

1) input sequence에 대해서 query, key, value로 세개의 linear projection을 수행해서 scaled dot-product attention을 구합니다.

2) h번 stack된 scaled-dot-product attention은 parallel하게 수행됩니다.

3) output은 concatenate를 한 후 linear layer를 거쳐 output을 생성하게 됩니다.

MultiHead(Q, K, V) = Concat(head_1, ...,head_h)W^O \text{ where, } head_i = Attention(QW^Q_i, KW^K_i, VW^V_i)

position-wise fully connected network는 두개의 linear layer로 구성하였습니다.

FFN(x) = max(0, xW_1+b_1)W_2+b_2

효율적인 학습을 위해서 Layer normalization과 residual connection을 각 sub-layer에 추가했습니다.

여기서 각 sub-layer는 multi-head attention 각각과 position wise fcn입니다.

LayerNorm(x + Sublayer(x))

Decoder

decoder는 text sequence를 encoder의 output과 input label을 기반으로 생성합니다. 각 input label에 대해 학습가능한 character-level embedding을 character 마다 d_model-dimension의 vector로 표현합니다. character embedding된 vector는 positional encoding을 거쳐 decoder의 input이 됩니다.

encoder와 비슷하게 decoder-block도 multi-head scaled dot-product attention과 position-wise fully connected network로 구성되있습니다.

여기서 두가지 다른점이 있는데

1) auto-regressive한 특성으로 인해 masking된 multi-head attention은 position j에 대한 prediction을 할때 위치 j에만 의존할 수 있도록 합니다.

2) multi-head attention은 encoder의 output으로 부터 key와 value를 가지지만 query는 이전 decoder block의 output에서 옵니다.

최종 output은 linear projection과 softmax를 거쳐 character 별 확률로 변화됩니다.(d_model, batch_max_length, 현재 Recognition의 Generator)

Experiments

Benchmark datasets

IIIT5K

ICDAR 2003(IC03)

ICDAR 2013(IC13)

ICDAR 2015(IC15)

SVT-P

CUTE80

Implementation detail

NRTR은 오직 Synth90k에서만 학습을 하고 다른 finetuing없이 벤치마크 데이터셋에 평가를 진행했습니다. training과 inference input image의 높이는 32로 고정하고 폭은 비율에 따라 조절했습니다. output은 38개의 class로 알파벳 lower case 26자, 10개의 숫자 그리고 space, end-of-sequence token으로 구성했습니다.

학습시 batch 구성은 image의 width를 approximate하여 구성했습니다. Adam optimizer를 사용하였고 β_1은 0.9 β_2는 0.98 e= 10^-9로 설정하였습니다.

lrate = d_{model}^{-0.5}\cdot{min(n^{-0.5}, n\cdot{\text{warmup n}^{-1.5}})}

n: current step

warm up_n: 초기 learning rate를 감소보다는 증가하도록 함.

warm up lr의 이해를 돕기 위한 graph입니다. 개형은 다릅니다.

또한 overfitting을 방지하기 위해서 drop out을 0.1로 설정했습니다.

6 epoch동안 학습을 했고 inference를 위한 model은 10개의 check point를 평균냈습니다.

Ablation study

1) Exploration of the encoder and the decoder

N_e개의 Encoder와 N_d개의 Decoder를 사용하고fully connected layer의 inner dimension은 d_ff를 사용했습니다. d_model=512, h=8(number of heads)로 설정했고 modality-transform block은 두개의 CNN을 사용했습니다.

baseline으로는 6 encoder 6 decoder를 사용했습니다. block의 총 숫자를 같게 유지했을 경우 encoder가 많을 수록 더 나은 성능을 보임을 확인했습니다.

더 나은 모델을 위해서 Layer를 늘렸을 경우 높은 성능을 얻을 수 있었습니다 (12 encoder, 6 decoder). 그 이상으로는 시간과 cost를 고려해 더 늘리지 않았습니다.

8enc4dec와 4enc8dec를 비교했을 때 깊은 encoder구조가 더 나은 representation을 추출함을 볼 수 있습니다.

d_ff를 바꿔주어 inner dimension을 늘려준 경우가 더 나은 성능을 보였습니다.


# Detection repo
N_d = 4
N_e = 4
d_model=512
h=8
d_ff=2048

2) Exploration of the modality-transform block

여러가지 구조와 아키텍쳐를 조사했지만 아래 두가지 example을 선택했습니다. Table 1과 같이 더 많은 CNN레이어는 성능을 감소시켰습니다. (기존 CRNN, RARE 등은 레이어 7개를 사용)

Encoder가 충분한 feature extraction ability를 가지고 있기 때문에 CNN 두 개를 두어 구분할 수 있는 feature를 만들도록 했습니다.

CNNLSTM는 recurrent connection을 이용해 temporal information를 포착합니다. 기본모델에서는 성능에 효과가 있지만 큰 모델에서는 악영향을 줬습니다. Encoder가 충분히 커서 CNNLSTM이 image 정보 추출에 과도했기 때문이라고 생각합니다.

Comparisons with the State-of-the-arts

위의 분석에 따라서 NRTR을 encoder 12개 decoder 6개 d_ff 4096 그리고 modality block은 두개의 CNN으로 구성했습니다. 최근에는 Synth90k와 SynthText에 학습을 하므로 공정성을 위해서 NRTR을 두개의 데이터셋에 학습했습니다.

1) Accuracy

regular benchmark에서 lexicon free는 대부분 기존 방법을 능가했습니다.

irregular benchmark에서도 irregular text를 위해 고안된 모델과 비교할만한 성능을 보였습니다.

2) Speed

Table2에서 방법에 따른 학습 시간을 볼 수 있습니다.

Report된 시간을 기준으로해서 모든 방법들의 학습속도를 찾을 수 없었습니다.

Training time per epoch /Gpu

Flop comparison: P40 > M40 ≈ TitanX.

Robust Scene Text Recognition with Automatic Rectification 16h/Titan X

Edit Probability for Scene Text Recognition 40h /Tesla M40

NRTR 5h Titan X

Inference Speed(per image)

NRTR: 0.03s
Edit Probability for Scene Text Recognition: 0.11s
Robust Scene Text Recognition with Automatic Rectification: 0.2s

3) visualization

On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention https://arxiv.org/pdf/1910.04396.pdf

MASTER: Multi-Aspect Non-local Network for
Scene Text Recognition https://arxiv.org/pdf/1910.02562.pdf — MASTER: Multi-Aspect Non-local Network for Scene Text Recognition https://arxiv.org/pdf/1910.02562.pdf

Bidirectional Scene Text Recognition
with a Single Decoder https://arxiv.org/pdf/1912.03656.pdf — Bidirectional Scene Text Recognition with a Single Decoder https://arxiv.org/pdf/1912.03656.pdf

Bidirectional Scene Text Recognition with a Single Decoder

v2에 추가됨. NRTR: IC15:79.4 SVT-P: 86.6, CUTE80: 80.9

Conclusion

이전에도 Transformer를 Recognition에 사용한 논문이 있었지만 Feature extractor로서의 역할을 하도록 Encoder를 활용한 점에 차이가 있습니다. 모델 구조상에는 Transformer를 그대로 사용했지만 Encoder와 Decoder 그리고 앞부분의 Modality를 어떻게 조합하느냐에 따른 성능의 변화가 Recognition 모델의 개선에 도움이 될 것으로 보입니다.