SuperGlue: Learning Feature Matching with Graph Neural Networks

Oct 29, 2021

SuperGlue: Learning Feature Matching with Graph Neural Networks

Contents

Introduction Proposed Method Experiment Conclusions

Introduction

이 논문에서는 local feature의 매칭을 위한 neural net을 제안합니다. 기존의 방법들이 task-agnostic local feature를 학습하고 단순한 matching 알고리즘과 heuristic을 사용하였다면, 여기서는 이미 존재하는 local feature들로부터 매칭하는 법을 학습합니다. SLAM 관점에서 생각해보면, visual feature를 추출하는 프런트엔드와 bundle adjustment, post-estimation을 수행하는 백엔드 사이에 있는 학습가능한 미들엔드라고 볼 수 있습니다.

SuperGlue는 아래처럼 매우 어려운 상황에서도 correspondence를 잘 찾아냅니다.

Proposed Method

두 이미지 A, B를 생각합시다. 각각은 kepoint position p와 visual descriptor d를 가집니다. Keypoint position에는 detection confidence를 더하여

p_i := (x,y,c)_i

라고 정의하겠습니다. Visual descriptor

\rm d_i \in \R^D

는 SIFT 또는 SuperPoint와 같은 CNN으로부터 추출될수도 있습니다. A, B는 각각 M, N 개의 local feature를 가진다고 가정하겠습니다.

모든 keypoint들이 서로 다른 이미지간에 매칭되지는 않을 것입니다(partial assignment). 우리의 목표는 두 local feature set 사이의 partial assignment를 계산하는 일입니다. Downstream task에 사용하기 위하여 이를 confidence value로 표현하겠습니다.

SuperGlue 구조는 아래와 같이 두 개의 컴포넌트로 구성되어있습니다.

(1) Attentional Graph Neural Network: keypoint position과 visual descriptor를 하나의 벡터로 만든뒤, self- and cross-attention 레이어를 반복적으로 적용하여 더 강력한 feature representation을 만듭니다.

(2) Optimal Matching Layer: MxN score matrix를 만들고, Sinkhorn 알고리즘을 사용하여 optimal partial assignment를 찾습니다.

Attentional Graph Neural Network

Keypoint Encoder

첫 representation

^{(0)} \rm x_{\it{i}}

은 visual description과 encoding된 position을 합한 것으로 정의됩니다.

Multiplex Graph Neural Network

두 이미지의 모든 keypoint들로 노드를 구성하는 그래프를 생각해봅시다. 그래프의 edge는 두 가지가 존재합니다:

\varepsilon_{self}

: Intra-image edge. 같은 이미지의 모든 keypoint끼리의 연결

\varepsilon_{cross}

: Inter-image edge. 서로 다른 이미지 keypoint 사이의 연결

Multiplex Graph Neural Network는 message passing을 사용하여 두 엣지를 따라 정보를 전달하며, 레이어를 거치면서 모든 노드로부터 메시지를 모아서 다음 representation을 업데이트합니다.

^{(l)} {\rm x}_{i}^{A}

를 A 이미지에서

i

번째 노드의

l

번째 레이어의 representation이라고 합시다. message

m_{\varepsilon\rightarrow i}

는 모든 키포인트로부터의 메시지의 합입니다. Residual message passing 업데이트는 아래와 같이 표현됩니다.

여기서 message는 self-attention 메커니즘으로 계산됩니다.

아래 그림은 self-attention과 cross-attention이 어떻게 작동하는지 weight

\alpha_{ij}

를 시각화한 것입니다.

최종적인 matching descirptor은 마지막 representation의 llinear projection 입니다.

Optimal Matching Layer

Optimal matching layer는 partial assginment matrix를 계산합니다. 일반적인 graph matching 문제에서, assignment

\rm P

는 모든 매칭

{\rm S} \in \R^{M \times N}

에 대하여 합산

\sum_{i,j}{\rm S}_{i,j}{\rm P}_{i,j}

을 최소화하는 것으로 얻어집니다.

Pairwise score는 두 이미지의 matching description 간의 내적으로 얻어집니다.

네트워크가 매칭되지 않는 키포인트들은 제거하도록 하기 위하여, 두 이미지 사이에서 매칭되지 않는 키포인트가 매칭되는 dustbin을 만듭니다.

A, B 각각의 dustbin은 모든 상대 이미지 키포인트와 매칭될 수 있습니다. 이를 고려하면 각각의 키포인트가 매칭될 수 있는 반대편 이미지의 키포인트 갯수는 아래와 같이 표현할 수 있습니다.

이제 augmented assignment

\bar {\rm P}

는 다음과 같은 제약조건을 가지게 됩니다.

이제 이 문제는 score matrix

\bar {\rm S}

를 가지는 optimal transport 문제가 됩니다. 이 soft-assignment 문제는 Sinkhorn 알고리즘으로 GPU에서 병렬적으로 계산할 수 있습니다.

Loss

Graph neural network와 optimal matching layer는 모두 미분가능하기 때문에 gradient는 매칭에서부터 visual descriptor까지 전파될 수 있습니다.

학습은 ground-truth 매칭을 사용하여 supervised learning으로 할 수 있습니다.

다음 keypoint들은 매칭되지 않는 것이라고 할 때,

loss는 assignment

\bar {\rm P}

의 negative log-likelihood로 정의합니다.

Experiment

모든 intermediate representation은 동일하게 D=256 dimension을 가집니다.

GNN은 L=9 갯수의 레이어를 가지며, self- and cross-attention 모두 각각 4개의 multi-head 구조입니다.

Sinkhorn 알고리즘의 반복횟수인 T는 100으로 설정했습니다.

Homography estimation

Oxford Paris dataset을 사용하여, 랜덤 homography를 적용한 뒤 실험하였습니다.

SuperGlue는 대부분의 outlier를 제거하면서 correspondence가 높은 퀄리티를 가지기 때문에, robust estimator인 RANSAC보다 오히려 non-robust, least-square 솔루션인 DLT에서 더 높은 점수를 얻었습니다.

Indoor pose estimation

실내 이미지는 texture가 부족하고 self-similarity가 높아 매우 어려운 문제로 꼽힙니다. ScanNet 데이터셋을 사용하였습니다.

Outdoor pose estimation

실외 이미지는 밝기, occlusion 등 어려움이 있습니다. 데이터셋으로는 PhotoTourism을 사용하였습니다.

Conclusions

Optional subsections (Heading 3)

See more posts

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis

October 29, 2021

Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition

October 29, 2021

SuperPoint: Self-Supervised Interest Point Detection and Description

October 29, 2021

DEFORMABLE DETR: DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION

October 29, 2021

SuperGlue: Learning Feature Matching with Graph Neural Networks

Introduction

Proposed Method

Experiment

Conclusions

Optional subsections (Heading 3)

More articles

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis

Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition

SuperPoint: Self-Supervised Interest Point Detection and Description

DEFORMABLE DETR: DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION