DEFORMABLE DETR: DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION

Oct 29, 2021

DEFORMABLE DETR: DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION

Contents

Introduction Related Works Proposed model

ICLR 2021

notion image

Introduction

Detection Transformer (DETR)은 2020년 ECCV에 발표된 논문으로, object detection 문제를 transformer를 이용하여 풀었습니다. Deformable DETR은 DETR을 개선한 논문으로, 성능과 inference를 조금 더 효율적으로 하는 구조를 제안했습니다.

Related Works

0. Transformer (Self attention mechanism)

Sequence의 element를 각 각 query, key, value로 만들어준 후, query와 key의 모든 combination (n^2)를 고려하여 attention weight를 구해줍니다.

notion image

1. DETR

DETR은 transformer를 이용해 object detection을 푼 방법론입니다.

1) image를 cnn backbone을 통과시켜 feature를 뽑습니다.

2) 뽑은 feature (H x W x C) 를 → (HW x C)로 reshape 합니다.

3) reshape한 feature를 transformer encoder에 입력으로 넣어줍니다.

4) 미리 정의해둔 object queries의 갯수 만큼 decoder에 입력으로 넣어줘서 object에 해당하는것들을 decoding 합니다.

notion image

DETR의 경우 다음과 같은 단점이 존재합니다.

1) converge까지 다른 object detection에 비해 training epoch가 많이 소요됩니다.

2) small object에 대하여 성능이 낮습니다.

notion image

Proposed model

Deformable DETR의 전체적인 구조는 DETR과 같습니다. DETR과 다른 점은 multi-head attention을 deformable multi-head attention으로 변화시킨것이 다릅니다.

해당 논문의 main아이디어는

이미지 pixel중 query와 관련있는 부분은 일부분이고,

attention weight를 구할 때, query와 관련있어 보이는 특정 key(혹은 value)에 대해서만 attention weight를 구하여 computational cost를 낮추는 방향을 제시했습니다.

notion image

Multi-head attention in Transformer

MultiHeadAttn(z_q,x)=\sum{W_m[\sum_{k\in\Omega_k}{A_{mqk}W'_{m}x_k}]}

$O(n_q n_k)$ : Query representation feature

$n_q$ : key, value representation feature

$n_k$ : Attention weight

$DeformableAttn(z_q,p_q,x)=\sum{W_m \sum_{k=1}^K{A_{mqk}W_mx(p_q+\triangle p_{mqk})}}$

Computational cost for attention weight :

( : the number of queries, : the number of keys (or values))

In DETR,

Encoder : , , computational cost =

Decoder : , , computational cost =

: Query representation feature

: 2-d reference point

: key, value representation feature

In Deformable DETR

Encoder : , , computational cost =

Decoder : , , computational cost =

notion image

Deformable DETR은 연산량과 메모리가 줄어든 만큼, 입력 feature를 multi-scale로 넣어줍니다.

notion image

Experimental Results

notion image

Share article