DOLG: Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local and Global Features

Dec 02, 2021

DOLG: Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local and Global Features

Contents

Introduction Proposed Method Experiment

Introduction

notion image

compact representative image descriptor를 생성할 수 있고 end-to-end 학습이 가능한 새로운 orthogonal global and local feature fusion framework를 사용하여 single-stage 이미지 검색을 제안

Discriminative local features을 extract하기 위해 multi-atrous 컨볼루션 레이어와 self-attention 모듈을 사용, 로컬 브랜치의 성능을 개선하도록 설계됨

본 논문에서 소개하는 single-stage 방법은 이전의 two-stage state-of-the-art를 significantly outperforms 함

Proposed Method

notion image

Overview

Global branch는 original ResNet에서 global averaging pooling을 GeM pooling으로 교체하고, feature dimension을 줄이기 위해 FC layer를 사용한 것 외에는 동일하게 유지

Res4의 output feature map에 GeM pooling을 수행하는 공식은 다음과 같이 정의됨

notion image

동시에 local decriptors를 extract하기 위해, local branch는 ResNet에서의 Res3 block 이후에 이어짐

local branch는 multiple atrous convolution layers와 self-attention module로 구성

이후, novel orthogonal fusion module을 사용하여 global & local feature를 aggregation하여 최종 compact descriptor를 생성

Local Branch

notion image

Muti-atrous convolution layer & self-attention

이미지 인스턴스 간의 스케일 변화를 처리할 수 있는 feature pyramid를 구성
feature map을 얻기 위한 서로 다른 spatial receptive field를 가진 3개의 dilated convolution layers와 및 global average pooling branch를 포함
features간 1 x 1 convolution layer로 연결하여 처리
self-attention 모듈에서는 먼저 1 x 1 conv-bn을 진행한 local feature를 대상으로, relu를 적용하고 다시 1 x 1 conv layer 및 SoftPlus 연산을 수행
구해진 attention map과 l2norm이 수행된 local feature를 element-wise product

Orthogonal Fusion Module

notion image

global feature상에서의 각 local feature point에 대한 projection을 계산

notion image

notion image

orthogonal component는 local feature와 local feature의 projection vector간의 차이

notion image

위의 방법으로 global feature에 직교하는 각 points(C x H x W tensor)를 출력할 수 있음

따라서 orthogonal fusion의 경우 global feature과 관련된 정보는 각 local feature point에서 제외됨
이미지를 더 잘 설명하기 위해 보완 정보를 제공할 뿐만 아니라 관련성이 없기 때문에 global feature를 특별히 강조하지 않음

이후에 해당 tensor를 global feature와 concatenate하고 새로운 tensor로 aggregation

Paper에서는 단순히 concatenated tensor를 aggregate하는 pooling 기능 정도로 언급하였지만, 현재 실험단계이며 실제로는 learnable module로 설계될 수 있음

끝으로 FC layer를 통과시켜 512 x 1 크기의 descriptor를 생성

Training Objective

DELG와 마찬가지로, 단일 L2-normalized N Class prediction head와 image-level labels가 학습에 사용

ArcFace margin loss로 전체 네트워크를 학습함

notion image

AF는 ArcFace-adjusted cosine similarity를 의미하며, 다음과 같이 계산됨

notion image

Experiment

Implementation Details

Datasets and Evaluation metric

Google landmarks dataset V2 (GLDv2)

5M images of 200K different instance tags

GLDv2-clean

1,580,470 images of 81,313 classes
해당 데이터로 모델 학습 수행

Oxford(Roxf) and Paris(Rpar) datasets

각 4993, 6322개의 images로 구성
각 70개의 다른 query set을 가짐
해당 데이터로 모델 evaluation을 수행

mean average precision(mAP)를 evaluation metric으로 사용

Implementation details

학습 데이터의 80%를 학습, 20%를 validation으로 사용
ResNet50과 ResNet101 backbone을 주로 평가
ImageNet pre-trained weights로 초기화
random cropping / distorting aspect ratio argumentation을 적용, 512 x 512로 resize
128 batch-size로 100 epoch 수행

8개의 16GB 메모리의 V100 GPUS 활용

ResNet50은 3.8일이 소요되었고, ResNet101은 6.3일이 소요
SGB optimizer with momentum of 0.9
Weight decay factor는 1e-4
lr = 0.05, 5 warming-up epochs 적용
ArcFace margin loss는 margin m을 0.15, scale =30
GeM pooling의 경우 p를 3.0으로 고정

Results

notion image

Comparison with local feature based solutions

R50-How 방식이 DELF를 압도하였지만, 동일한 ResNet50 backbone을 사용하고도 DOLG가 더 좋은 성능을 발휘

Comparison with global feature based solutions

Global features를 사용하는 R101-DELG보다도 R50-DOLG가 더 좋은 성능을 발휘

Comparison with global+local feature based solutions

local feature를 함께 활용하는 DELG보다 좋은 성능을 발휘

Roxf-Hard에서 R101-DELG가 더 좋은 성능을 보인것에 대한 언급 X

“+1M” distractors

엄청난 양의 distractors가 존재하는 경우 less robust global and local feature으로 인해 더 심각한 error accumulation이 발생

notion image

Qualitative Anaysis

notion image

Ablation Study

설계 선택 중 일부를 경험적으로 확인하기 위해 ResNet50 백본을 사용하여 ablation experiments을 수행

Where to Fuse

notion image

global and local orthogonal integration에 더 나은 block을 확인하기 위해 선택을 검증
얕은 레이어는 local feature representations에 적합하지 않은 것으로 알려져 있으므로 주로 res3 및 res4 블록을 확인
f3는 sufficient spatial resolution과 sufficient network depth를 가지므로, f4보다 local features로 적합
f3 & f4는 model을 더 복잡하게 만듦

Impact of Poolings

GeM 풀링[34]과 Global Average Pooling이 전체 프레임워크에 어떤 영향을 미치는지 보임

notion image

Impact of Each Component in the Local Branch

notion image

Multi-Atrous는 성능을 약간 떨어뜨리지만 mAP가 이미 매우 높고 검색 성능 저하가 easy case에 대해서만 해당하기에 문제가 되지 않음

Verification of the Orthogonal Fusion

Orthogonal Fusion이 더 나은 선택임을 보여주기 위해 그림 4a에 표시된 orthogonal decomposition procedure를 제거하고 fl 및 fg를 concatenate하여 실험
또한 Hadamard product (also known as element-wise product)을 사용

notion image

Share article