Similarity Reasoning and Filtration for Image-Text Matching

Oct 29, 2021

Similarity Reasoning and Filtration for Image-Text Matching

Contents

Introduction Proposed Method Experiment Quatitative Results and Analysis

이미지-텍스트 매칭은 이미지와 텍스트 간의 시각적 의미 유사성을 측정하는 것을 의미하며, 이는 다양한 시각 및 언어 작업에서 점점 더 중요해지고 있습니다.

최근 몇 년 동안 이 분야에서 큰 발전이 있었지만 이미지와 텍스트 간의 복잡한 매칭 패턴과 큰 semantic discrepancies로 인해 이미지-텍스트 매칭은 여전히 어려운 문제로 남아 있습니다.

Introduction

notion image

이전 접근 방식에는 3가지의 단점이 존재합니다.

local features 간에 스칼라 기반 코사인 유사성을 계산하는데, 이는 regions과 words간의 연관 패턴을 특성화하기에 충분히 강력하지 않습니다.
대부분은 지역과 단어 사이의 잠재적인 모든 정렬을 단순히 max pooling 또는 average pooling으로 집계하여 로컬 정렬과 글로벌 정렬 간의 information communication을 방해합니다.
의미 없는 alignments의 distractions을 고려하지 않습니다.

위와 같은 문제를 해결하기 위해 본 논문에서는 다음과 같은 방법론을 제안합니다.

전체 이미지와 전체 문장 사이의 전역 정렬뿐만 아니라 이미지 영역과 문장 조각 사이의 로컬 정렬을 caputure 합니다.
스칼라 기반 코사인 유사도로 이러한 정렬을 특성화하는 대신, cross-modal associations을 보다 효과적으로 모델링하기 위해 벡터 기반 유사성 표현을 학습할 것을 제안합니다.
그런 다음 GCNN(Graph Convolution Neural Network)에 의존하는 유사성 그래프 추론(SGR) 모듈을 도입하여 로컬 및 전역 정렬 간의 관계를 캡처하여 보다 정확한 이미지-텍스트 유사성을 추론합니다.
또한 SAF(Similarity Attention Filtration) 모듈을 개발하여 서로 다른 중요도 점수가 수반되는 모든 정렬을 집계하여 의미 없는 정렬의 간섭을 줄이고 보다 정확한 cross-modal matching 결과를 얻습니다.

본 논문의 주요 기여는 다음과 같이 요약됩니다:

이미지와 문장 사이의 전역 정렬뿐만 아니라 영역과 단어 사이의 로컬 정렬을 특성화하는데 더 큰 능력을 가능하게 하는 이미지-텍스트 매칭을 위한 벡터 기반 유사성 표현을 학습할 것을 제안합니다.
그래프 추론을 통해 이미지-텍스트 유사성을 추론하는 Similarity Graph Reasoning (SGR) 모듈을 제안합니다. 이 모듈은 로컬 및 전역 정렬 간의 관계를 캡처하여 더 복잡한 matching 패턴을 식별하고 보다 정확한 예측을 달성할 수 있습니다.
유사성 집계에서 무의미한 단어의 간섭을 고려하고, 매칭 정확도를 더욱 향상시키기 위해 관련 없는 상호 작용을 억제하는 효과적인 SAF(Similarity Attention Filtration) 모듈을 제안합니다.

Proposed Method

notion image

Generic Representation Extraction

Visual Representations.

각 input image를 대상으로 pretrain된 Fatser R-CNN model를 기반으로 visual features를 추출
Fully-connect layer를 추가하여 d-dimensional vectors로 변환하고 이를 local region representations 로 정의
이후에 local regions를 대상으로 self-attention mechanism을 수행, global representation 을 획득

notion image

Textual Representations

주어진 문장을 토큰화 기법을 사용하여 단어로 분할하고 단어 임베딩을 bi-directional GRU에 순차적으로 공급
그런 다음 각 time step에서 순방향 및 역방향 hidden state를 averaging하여 각 단어의 representation을 획득

notion image

마찬가지로 global text representation 는 모든 word features를 대상, self-attention 방법으로 계산

Similarity Representation Learning

Vector Similarity Function

기존 방법과 달리 본 논문은 스칼라 값 대신 벡터 기반의 유사성 표현을 계산하여 서로 다른 modalities의 특징 표현간의 보다 자세한 연관성을 포착
Similarity function은 다음과 같음

notion image

는 m-dimensional similarity vector를 획득하기 위한 학습가능한 파라미터 matrix

notion image

Global Similarity Representation

global image feature 및 sentence features를 Eq(1)을 통해 계산

notion image

는 global similarity representation을 학습하기 위한 learnable parameter matrix

Local Similarity Representation

시각적 및 텍스트 observations의 local features간의 로컬 유사성 표현을 활용하기 위해 text-to-visual attention을 적용
각 region에 대한 attention weight은 다음과 같이 계산됨

notion image

는 softmax function을 통해 계산되며 는 i번째 region feature와 j번째 word feature 간의 코사인 유사도

notion image

Attended visual features를 다음과 같이 계산

notion image

local similarity representation은 다음과 같이 계산

notion image

은 local similarity representation을 학습하기 위한 learnable parameter matrix

Similarity Graph Reasoning

Graph Building

모든 word-attended 유사도 표현과 global 유사도 표현을 그래프 노드로 취함

notion image

to edge node를 다음과 같이 계산

notion image

과 은 각각 incoming 및 outgoinig nodes를 위한 linear transformations

및 노드 사이의 edge는 방향을 지정하므로 유사성 추론을 위한 효율적이고 복잡한 information propagation이 가능

Graph Reasoning

구축된 graph nodes 및 edges를 가지고 다음과 같은 연산을 통해 nodes와 edge를 updating하여 similarity graph reasoning를 수행

notion image

N steps의 유사성을 반복적으로 reasoning하고 마지막 step에서 global node의 출력을 reasoned similarity representation으로 취하여 fully-connect layer에 공급, final similarity score를 획득 함

Similarity Attention Filtration

로컬 정렬을 활용하면 이미지 영역과 sentence fragments 사이의 더 세밀한 대응이 가능하여 matching 성능을 높일 수 있지만, less-meaningful alignments로 인해 구별 능력을 방해

Similarity Attention Filtration (SAF) module을 통해 비효율적인 정렬을 억제하고 중요한 정렬을 강조함

local 및 global 유사도 표현이 주어졌을 때, 각 유사도 표현에 대한 aggregation weight 는 다음과 같이 계산됨

notion image

구해진 weights를 기반으로 similarity representations을 계산하여 fully-connect layer에 공급한 뒤, 이미지와 sentence간의 final similarity를 예측

notion image

Training Objectives and Inference Strategies

bidirectional ranking loss를 활용하여 SGR 및 SAF 모듈을 모두 훈련

일치하는 image-text pair (v, t)와 minibtach 내에서 hardest negatvie image & negative text에 대응하는 bidirectional ranking loss를 계산

notion image

는 margin parameter

은 similarity prediction function SGR

마찬가지로 SAF module은 로 정의

SGR과 SAF에 대해 joint training & independent training 을 실험

joint training은 두 모듈이 similarity representations을 공유하며 동시에 Loss를 계산

inference 단계에서는 SGR 및 SAF 모듈에서 예측된 similarities를 averaging

Experiment

Quatitative Results and Analysis

notion image

notion image

Ablation Studies

notion image

notion image

notion image

SGR 모듈은 local 및 global 정렬간에 information propagating을 하여 몇가지 중요한 단서를 자주 포착하고 상대적으로 중요하지 않은 interactions를 제거

반면 SAF 모듈은 모든 의미있는 정렬을 수집하고 완전히 관련없는 interactions를 제거

따라서 SGR 및 SAF 모듈이 겉보기에는 호환되지 않는 것으로 보이며, independent training에서 명백한 개선이 이뤄짐

Qualitative Results and Analysis

notion image

Share article