Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Oct 29, 2021

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Contents

Introduction

본 논문에서 저자들은 Transformer를 기존 컴퓨터 비전 전반에 사용되는 Backbone으로 활용할 수 있는 방법에 대해 고민하였습니다. 저자들은 자연어 도메인에서의 높은 성능을 비전 도메인으로 가져오지 못하는 것은 2가지 문제점 때문이라고 말합니다.

첫째는 Scale입니다. 단어를 하나의 토큰으로 취급해도 무방한 자연어와는 달리 컴퓨터 비전에서는 요소 별로 Scale이 크게 다를 수 있습니다. 기존의 Transformer를 활용하는 방법론들은 모두 고정된 크기의 이미지 토큰을 처리하므로 컴퓨터 비전 도메인에 적합하지 않습니다.

둘째는 해상도입니다. 자연어에서의 단어와는 다르게 이미지는 Pixel 레벨에서 종종 처리되어야하며 특히 Semantic Segmentation 같은 Pixel 단위의 예측이 필요한 경우엔 이미지 크기의 제곱에 해당하는 계산복잡도가 필요합니다.

Figure 1. (a) The proposed Swin Transformer builds hierarchical feature maps by merging image patches (shown in gray) in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window (shown in red). It can thus serve as a general-purpose backbone for both image classification and dense recognition tasks. (b) In contrast, previous vision Transformers [19] produce feature maps of a single low resolution and have quadratic computation complexity to input image size due to computation of self-attention globally.

이를 극복하기 위해 저자들은 Swin Transformer를 제안합니다. Swin Transformer는 작은 사이즈의 Patch(Figure 1에서 회색 선)부터 시작하여 서서히 병합해나가는 방식을 사용합니다. 이 계층적 구조를 통해 FPN이나 U-Ne t에서처럼 Dense한 예측을 적용할 수 있게 됩니다. 또한 Self-Attention을 구역 별(Figure 1에서 붉은 선)로 적용하여 각각의 Window가 겹치지 않게 하였으며 이를 통해 이미지 크기에 선형적으로 비례하는 계산 복잡도를 달성할 수 있었습니다. 이것은 곧 기존의 방법론들은 계산 복잡도로 인해 도전하지 못했던 다양한 Vision Task에 Swin Transformer를 적용할 수 있게 되었음을 시사합니다.

Method

Overall Architecture

Figure 3. (a) The architecture of a Swin Transformer (Swin-T);
(b) two successive Swin Transformer Blocks.
W-MSA and SW-MSA are multi-head self attention modules with regular and shifted windowing configurations, respectively. — Figure 3. (a) The architecture of a Swin Transformer (Swin-T); (b) two successive Swin Transformer Blocks. W-MSA and SW-MSA are multi-head self attention modules with regular and shifted windowing configurations, respectively.

Swin Transformer는 RGB 이미지를 입력으로 받아 겹치지 않는(Non-overlapping) Patch들로 분할합니다(Patch Partition). 각각의 Patch는 Token으로 간주하며 Raw RGB 값을 Concatenate하여 생성합니다. 예를 들어, Patch 크기가 4 x 4 일 때 하나의 Token은 4 x 4 x 3 = 48개의 Feature를 갖습니다. 후에 Linear Embedding 층을 거쳐 Feature를 임의의 차원 C로 Projection합니다. 후에 적용되는 Swin Transformer Block에 대해서는 후술하며, 해당 모듈은 Feature의 차원을 보존합니다. 여기까지를 Stage 1이라 칭합니다.

Stage 2에서는 인접한 2 x 2 Patch를 Concatenate하여 Feature를 생성합니다. 즉, (H/8) x (W/8) x 4C 차원의 Feature가 생성됩니다. 이를 Stage 3와 4에서도 반복합니다. 이렇게 생성된 Feature들은 VGG나 ResNet이 생성하는 계층적 Feature들과 크기가 동일하여 이미 사용 중인 Backbone 네트워크를 Swin Transformer로 교체하기 용이하게 합니다.

Shifted Window based Self-Attention

연산적 효율성을 위해, 저자들은 Self-attention을 Local Window 범위에서 진행할 것을 제안합니다. 각 Local Window가 M x M 개의 Patch를 가지고 있다고 가정하면 h x w 크기의 이미지에서 Self-attention의 계산 복잡도는 아래와 같이 계산됩니다.

전자의 계산 복잡도는

(hw)^2

인 것에 반해 후자는 M을 고정할 경우

(hw)

에 선형적으로 비례함을 알 수 있습니다.

Shifted Window Partitioning in Successive Blocks

Window-based Attention은 Window들간의 Connection이 부족하다는 문제점이 있습니다. 이는 곧 모델의 성능 저하로 이어질 여지가 있습니다. Window 연산의 이점은 유지하면서, Cross-window Connection을 유지하는 방법으로 저자들은 Shifted Window Partitioning을 제안합니다.

Figure 2. An illustration of the shifted window approach for computing self-attention in the proposed Swin Transformer architecture. In layer l (left), a regular window partitioning scheme is adopted, and self-attention is computed within each window. In the next layer l + 1 (right), the window partitioning is shifted, resulting in new windows. The self-attention computation in the new windows crosses the boundaries of the previous windows in layer l, providing connections among them.

좌측의 그림은 Regular 방식으로 8 x 8 Patch를 분할하고, 2 x 2 Window로 쪼갠 것입니다 (M = 4). 그리고 우측의 그림은 이전 레이어(좌측)으로부터 Shifting하여 Partitioning한 모습입니다. 이 과정을 연속적으로 반복하는 Swin Transformer의 연산 과정은 아래와 같습니다.

Shifted Window Partitioning을 통해 이웃한 Non-overlapping Window들간의 Connection을 형성하고 이는 각종 Task에서 효과적임이 실험을 통해 증명되었습니다.

Efficient Batch Computation for Shifted Configuration

Shifted Window Partitioning의 문제점 중 하나라면, 일부 Window는 M x M보다 작은 사이즈의 Patch를 갖게 된다는 겁니다. 이를 해결하기 위한 쉬운 방법 중 하나는 M x M 에서 모자란 부분을 Masking해버리는 방법이 있지만, 효율적인 방법이라고 부르기는 어렵습니다. 저자들은 더 효율적인 Batch를 만드는 방법인 Cyclic-shifting을 제안합니다.

*Figure 4. Illustration of an efficient batch computation approach for self-attention in shifted window partitioning.*

이 기법을 적용하게 되면 몇몇 Window는 인접하지 않은 Sub-window를 포함한 채로 Self-attention을 진행하므로 Masking과 같은 효과를 낼 수 있습니다. 게다가 Window의 크기를 조정할 필요 없이 일관적으로 M x M 연산을 하므로 효율적입니다.

Relative Position Bias

저자들은 선행 연구에서 적용된 것과 같이 Attention을 계산할 때에 Position Bias B를 추가해주었습니다.

Relative Position이 [ -M + 1, M - 1 ] 에서 정의되므로 저자들은 Bias Matrix

\hat{B}

를 정의하였습니다. Bias B는

\hat{B}

으로부터 가져와서 사용합니다.


# Relative Position의 예

[[  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13],
 [ -1,   0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12],
 [ -2,  -1,   0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11],
 [ -3,  -2,  -1,   0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10],
 [ -4,  -3,  -2,  -1,   0,   1,   2,   3,   4,   5,   6,   7,   8,   9],
 [ -5,  -4,  -3,  -2,  -1,   0,   1,   2,   3,   4,   5,   6,   7,   8],
 [ -6,  -5,  -4,  -3,  -2,  -1,   0,   1,   2,   3,   4,   5,   6,   7],
 [ -7,  -6,  -5,  -4,  -3,  -2,  -1,   0,   1,   2,   3,   4,   5,   6],
 [ -8,  -7,  -6,  -5,  -4,  -3,  -2,  -1,   0,   1,   2,   3,   4,   5],
 [ -9,  -8,  -7,  -6,  -5,  -4,  -3,  -2,  -1,   0,   1,   2,   3,   4],
 [-10,  -9,  -8,  -7,  -6,  -5,  -4,  -3,  -2,  -1,   0,   1,   2,   3],
 [-11, -10,  -9,  -8,  -7,  -6,  -5,  -4,  -3,  -2,  -1,   0,   1,   2],
 [-12, -11, -10,  -9,  -8,  -7,  -6,  -5,  -4,  -3,  -2,  -1,   0,   1],
 [-13, -12, -11, -10,  -9,  -8,  -7,  -6,  -5,  -4,  -3,  -2,  -1,   0]

Architecture Variants

저자들은 ViT-B 혹은 DeiT-B와 유사한 연산량을 가진 모델을 Swin-B로 칭하며 Base Model로 지정하였습니다. 추가적인 모델로는 Swin-T(0.25x), Swin-S(0.5x), Swin-L(2x)가 있으며 Swin-T와 Swin-S는 각각 ResNet-50과 ResNet-101에 유사한 연산량을 가집니다. Window Size M은 7을 기본으로 설정하며 Query Dimension인

d_k

는 32, Expansion Layer인 MLP의 a는 4로 설정하였습니다. 최종적인 하이퍼파라미터는 아래와 같습니다.

Experiments

Image Classification: ImageNet-1K

Object Detection: COCO

*Table 2. Results on COCO object detection and instance segmentation. † denotes that additional deconvolution layers are used to produce hierarchical feature maps. * indicates multi-scale testing.*

Semantic Segmentation: ADE20K

Table 3. Results of semantic segmentation on the ADE20K val and test set. † indicates additional deconvolution layers are used to produce hierarchical feature maps. ‡ indicates that the model is pre-trained on ImageNet-22K.

Ablation Study on the Shifted Windows

Table 4. Ablation study on the shifted windows approach and different position embedding methods on three benchmarks, using the Swin-T architecture. w/o shifting: all self-attention modules adopt regular window partitioning, without shifting;

abs. pos.: absolute position embedding term of ViT
rel. pos.: the default settings with an additional relative position bias ter
app.: the first scaled dot-product term — Table 4. Ablation study on the shifted windows approach and different position embedding methods on three benchmarks, using the Swin-T architecture. w/o shifting: all self-attention modules adopt regular window partitioning, without shifting; abs. pos.: absolute position embedding term of ViT rel. pos.: the default settings with an additional relative position bias ter app.: the first scaled dot-product term

Self-Attention Computation Speed Comparison

*Table 5. Real speed of different self-attention computation methods and implementations on a V100 GPU.*

Self-Attention Method Comparison

*Table 6. Accuracy of Swin Transformer using different methods for self-attention computation on three benchmarks.*