Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Sep 23, 2019

Bottom-Up and Top-Down Attention for Image Captioning
and Visual Question Answering

Contents

Introduction Proposed Method Evaluation Conclusions

Document를 이해하고 key에따른 value를 찾는 task를 찾던 중 ICDAR의 2019 Robust Reading Challenge on Scene Text Visual Question Answering를 접하게 되었습니다. 이 task는 사진과 질문을 input으로 받고 그에 따른 답변을 하는 것입니다.

Task 1 - Strongly Contextualised, 각 이미지에 대한 word list가 주어짐.

Task 2 - Weakly Contextualised, 30,000단어가 주어짐.


gt = Hartford
det = yartforc
score = 0.75
question = What is printed on the top line of the hoodie?


gt = Kristian Svensson
det = kristian svensson
score = 1
question = Who holds the copyright?


gt = Pioneer
det = pioneer
score = 1
question = What is the name of the boat?

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering는 위 Challenge에서 1위를 달성였습니다. 논문의 게재 이전에는 OCR 기반이 아닌 2017 VQA challenge에서 1위를 달성하였습니다.

(CVPR), 2018. Oral Presentation

Introduction

VQA task를 위해서 visual attention의 사용은 보편적입니다. 사람은 task에 따라서 이미지를 context에 따라서 top-down으로 보고 새로운 단서나 예기치못한 것을 발견했을 때 bottom-up으로 집중하게 됩니다.

위 방식을 attention으로 구현하여 non-visual인 task-specific context를 top-down으로 visual feed-forward를 통한 attention mechanism을 bottom-up으로 풀고자 합니다.

visual attention을 얻기위한 mechanism은 아래 사진의 왼쪽과 같이 uniform greed상에서 동작합니다. 하지만 본 제안에서는 오른쪽과 같이 물체와 level에 따르게 됩니다. 이를 위한 bottom-up mechanism은 Faster R-CNN을 기반으로 구현되어 중요한 영역의 feature vector를 얻게 합니다.

Top-down mechanism은 이렇게 얻어진 feature의 attention distribution을 계산하기 위해서 task-specific context를 사용합니다. context에 따라서 image feature는 weighted average됩니다.

Proposed Method

1. Bottom-up attention model

ResNet-101를 기반으로한 Faster RCNN을 Bottom-up attention model로 사용하였습니다. VQA나 image captioning에 사용하기 위한 image feature V를 만들어내기 위해서 최종 output으로부터 nms를 적용하고 confidence threshold를 넘는 object들을 뽑아냅니다. 선택된 region i에따른 v_i는 영역 i로부터 만들어진 mean-pooled convolution feature입니다. 이 feature의 dimension D는 2048입니다.

2. Captioning model

captioning model은 'soft' top-down attention mechanism을 사용해서 image feature로부터 만들어진 sequence context를 input의 feature를 weighting합니다. 이같은 방식은 많이 사용되어지지만 다음의 design을 통해서 sota를 달성할 수 있었습니다

captioning model에는 Top-down attention LSTM과 Language LSTM으로 구성된 두개의 LSTM을 표준으로 사용합니다. LSTM layer는 다음과 같이 step이 t일때 input x와 step t-1일때의 output을 input으로 받습니다.

아래의 그림에서 볼 수 있듯이 Captioning model의 첫번째 LSTM model은 Top-Down Attention LSTM, 두번째 LSTM은 Language LSTM으로 정의합니다.

2.1 Top-down attention LSTM

첫번째 LSTM layer인 Top-down attention LSTM은 step t일때 input으로 이전 caption model의 output인 h^2_t-1과 image feature v에서 mean-pooling을 한 후 concat한 v 그리고 이전 step의 word의 encoding을 받습니다. 이 input들은 attention LSTM에 제공되어 context에 연관된 정보를 최대화하도록 합니다.

W_e는 word vocabulary Σ에 대한 embedding matrix입니다.

Π_t는 time step t의 input word의 one-hot encoding입니다.

1) attention

attention α는 위 attention LSTM의 step t일때 output인 h^1으로부터 아래의 식을 통해서 계산됩니다. W는 학습되는 LSTM의 parameter입니다. 각 t의 h_t로부터 k개의 image feature v_i각각에 대한 normalized attention weight인 α_i,t를 얻게됩니다. 그리고 softmax를 적용해서 최종적인 attention을 얻습니다.

k개의 image feature에 대해서 attention을 weighting 해줍니다.

2.2 Language LSTM

language model LSTM의 input은 top-down attention LSTM의 output과 attented image feature입니다.

v_t: attended image feature

h_t_1: output from attention LSTM

language model의 hidden state로부터 word sequence의 conditional probability를 구함.

y1:T는 word sequence를 의미합니다.

W_p와 b_p는 model parameter입니다.

1부터 T까지의 conditional probability를 곱해서 전체에 대한 확률을 구합니다.

2.3 Objective

1) Cross entropy

y^*_i:T : Target ground truth

θ: captioning model parameter

위의 cross entropy식을 최소화하도록 parameter를 학습합니다.

2) Score function

최신 연구들과 공정한 비교를 위해서 score function을 이용해서 CIDEr에 최적화한 결과도 확인해보았습니다.

score function의 negative expected score로 구한 loss

r: score function ( e.g CIDEr)

위의 식을 Self-Critical Sequence Training(SCST)에 따른 gradient

y^s_i:T : sampled caption

r(^y_1:T): 현재 model에서 greedy decoding을 통해 얻어진 baseline score

SCST는 강화학습과 같이 학습하면서 policy에 따라 caption을 sampling을 하면서 caption space를 탐색합니다. 실험에서는 SCST를 따르지만 학습 속도를 올리기 위해서 sampling distribution을 제한하였습니다. beam search decoding을 사용해서 decoded beam의 caption만 sampling하도록 했습니다. 실험을 통해서 beam search를 사용해서 decoding을 할때 적어도 높은 하나의 caption을 포함하게 된다는 것을 발견했습니다. 하지만 log-probability에서는 이 caption이 가장 높은 확률을 갖지는 않았습니다. 적은 수이지만 몇몇 unrestricted caption sample은 greedy-decoded caption보다 높은 점수를 보였습니다.

3. VQA model

제안하는 VQA model은 question representation을 context로 사용한 attention을 통해 각 feature를 weight합니다. 전체적인 구조에서 볼 수 있듯이 VQA model은 question과 image의 multi modal embedding을 사용합니다. 그 뒤로는 answer candidate에 대한 regression score를 계산하게 됩니다.

1) gated tanh

non-linear transformation network에는 gated hyperbolic tangent activation을 사용하였습니다. 이는 highway network에서 Relu나 tanh layer보다 실험에서 강한 이점을 보였습니다. 'gated tanh' 레이어는 다음과 같이 정의 됩니다.

σ: sigmoid activation function

W, W': learned weight

b, b': learned bias

◦ : Hadamard (element-wise) product

question은 학습된 word embedding를 통해서 encode됩니다. 이후 gated recurrent unit(GRU)를 거쳐 만들어진 hidden state q는 unnormalized attention weight a_i를 만드는데 사용됩니다.

w^T_a: learned parameter vector

attended feature와 GRU의 hidden state q를 받아 h(output of attention lstm)를 얻고 최종 결과물인 p(y)는 h를 language attention model에 넣어줌으로써 계산됩니다.

Evaluation

1. Dataset

1) Visual Genome dataset

bottom-up attention model을 pretrain하기 위해서 visual genome dataset을 사용하였습니다. 108k개의 이미지와 함께 object와 그 사이의 관계를 포함하고 있습니다. 또한 1.7M개의 visual question answer를 포함합니다. pretraining에는 5k개를 validation 나머지 103k를 training에 사용하고 data의 attribute는 object attribute data만 사용되었습니다.

object와 attribute annotation이 자유로운 string으로 구성되어있어 이를 2000개의 object class와 500개의 attribute class로 줄였습니다. 모호한 class와 성능이 낮은 것을 제외하고 최종적으로는 1600개의 object class와 400개의 attribute class를 사용했습니다.

2) MS COCO dataset

captioning model을 평가하기 위해서 MS COCO 2014 caption dataset을 사용하였습니다. ‘Karpathy’ split을 사용하여 나눴고 이 split은 각 5개의 caption을 포함한 113,287개의 training image와 5K개의 training, validating image를 포함합니다.

아래의 과정을 거쳐서 10,010개의 어휘를 뽑아서 사용했습니다.

only minimal text-preprocessing

all sentences to lower case

tokenizing on white space

filtering word do not occur at least five times

caption의 품질을 평가하기 위해서 SPICE [1], CIDEr [43], METEOR [8], ROUGE-L [22], BLEU [29] 등을 사용했습니다.

3) VQA v2.0 dataset

제안된 VQA model을 평가하기 위해서 최근 도입된 VQA v2.0 dataset을 사용했습니다. 이 데이터셋은 2017 VQA challenge의 기본 dataset으로 사용되었습니다. MS COCO에 연관된 1.1M개의 question과 11.1M개의 답변이 포함되어있습니다.

표준 질문에 대해서는 tokenizing을 하고 연산 효율을 위해서 question은 14 단어로 제한되어 trim되었습니다. candidate answer는 training set에서 8번이상 answer로 나타난 answer의 set으로 구성하여 3,129개의 어휘입니다.

answer의 품질을 평가하기위해서 standard VQA metric을 사용하였습니다.

2. ResNet Baseline

image captioning과 VQA 실험에서 bottom-up attention의 효과를 정량화하기 위해 full model과 ablated baseline을 비교하엿습니다. baseline (ResNet)에서는 Imagenet Pretrained ResNet을 bottom-up attention model 대신 사용했습니다.

3. Image captioning result

full model과 resnet baseline을 이전 sota인 Self-critical Sequence Training(SCST)와 비교.

subcategory에 따른 SPICE F-score가 up-down attention model을 통해서 오르는 것을 확인.

4개의 CIDEr Optimization model을 ensemble한 결과로 MSCOCO evaluation server에서 공식제출받은 결과

4. VQA result

full Up-down VQA model과 ResNet baseline을 비교. (NxN)은 VQA 모델로 사용된 ResNet의 input size. ResNet baseline이 약 두배의 convolution layer를 사용하지만 성능은 더 떨어짐.

30개의 model을 ensemble한 결과를 official VQA 2.0 server에 제출한 결과로서 제출 당시의 다른 submission을 능가하는 성능을 보였고 2017 VQA challenge에서 1위를 수상.

5. Qualitative Analysis

제안된 attention 방법의 동작을 확인하기 위해서 image captioning과정에서 up-down captioning model의 attention 영역을 시각화

Conclusions

bottom-up과 top-down visual attention mechanism을 제안했습니다. 이를 통해서 더 자연스럽게 물체의 level과 중요한 영역으로 attention을 연산할 수 있습니다. 이를 통해서 Image captioning과 VQA 두 task에서 sota를 달성했습니다.