MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

이 논문은 “가볍고 빠른 Mobile vision task를 위한 모델을 만들기 위해 CNN의 강점과 ViT의 강점을 합칠 수 있을 까?”에 대한 질문에 대답합니다.

Inc Lomin

Jun 28, 2022

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Contents

이 논문은 “가볍고 빠른 Mobile vision task를 위한 모델을 만들기 위해 CNN의 강점과 ViT의 강점을 합칠 수 있을 까?”에 대한 질문에 대답합니다.

MobileViT는 ImageNet-1k 데이터셋에서 78.4%의 top-1 정확도로 비슷한 파라미터 개수(~6M)의 MobileNetv3와 DeIT 보다 각각 3.2%, 6.2% 더 높은 성능을 보였습니다.

Introduction

현재 vision task에서 CNN의 대안으로 self-attention-based (ViT 계열) 모델들이 CNN을 능가하는 성능을 보여주고 있지만 모델 크기와 latency 측면의 비용이 크다는 인식이 있습니다.

그러나 많은 real-world application에서는 자원의 제약이 있는 mobile device에서 작동되는 모델을 요구하고 있습니다. 지금까지 mobile device에서 작동할 수 있는 모델은 모두 CNN기반의 모델입니다.

현재 ViT계열의 모델들의 성능은 비슷한 파라미터 수 대비 light-weight CNN 모델에 비해 성능이 떨어진다고 합니다.

예를 들어 5~6M의 파라미터를 가진 DeIT는 MobileNetv3보다 3% 가량 성능이 떨어집니다.

또한 대부분의 ViT계열의 모델은 image-specific inductive bias의 부재로 많은 parameter가 필요하며 최적화가 어려우며 강한 data augmentation과 L2 regularization이 필요하다는 단점이 있습니다.

CNN-based 모델과 ViT-based 모델의 장단점은 다음과 같습니다.

ㅤ	CNN based	ViT based
장점	- 가볍고 빠르다. - 학습이 쉽다. - Spatial inductive bias	- Long-range dependency - 높은 성능
단점	- 국소적	- Big & slow - 학습이 어렵다. (data aug., regularization)

저자들은 CNN와 transformer의 강점을 조합한 mobile vision task용 모델을 만들고자 하였습니다.

특히 light-weight, general-purpose, 그리고 low latency에 초점을 맞추어 “MobileViT”를 디자인 하였습니다.

저자들에 의하면 light-weight ViT가 단순한 training recipe으로 다양한 vision task에서 light-weight CNN에 준하는 성능을 보인 것은 처음이라고 합니다.

MobileViT의 특징을 요약하면 다음과 같습니다.

Better performance 비슷한 parameter 수에서 mobile vision task에 대해 기존의 CNN 기반의 모델의 성능을 능가함.

Generalization capability (training과 evaluation metric사이의 gap을 의미.) 기존의 ViT 계열(with or without CNN)은 intensive augmentation을 사용해도 CNN 기반의 모델보다 낮은 generalization capability를 보임.

Robust hyper-parameter tuning은 time, resource consuming하므로 좋은 모델은 hyper-parameter에 대해 robust해야 함. MobileViT는 기존의 ViT 계열의 모델과 달리 기본적인 augmentation을 사용하고 L2 regularization에 민감하지 않음(less sensitive).

Related Works

Light-weight CNNs

MobileNets (Howard et al., 2017; Sandler et al., 2018; Howard et al., 2019), ShuffleNetv2 (Ma et al., 2018), ESPNetv2 (Mehta et al., 2019), MixNet (Tan & Le, 2019b), and MNASNet (Tan et al., 2019).

Vision transformers

ViT (Dosovitskiy et al., 2021), DeIT (Touvron et al., 2021a)

Subsequent works shows that this substandard optimizability is due to the lack of spatial inductive biases in ViTs.

e.g., Graham et al., 2021; Dai et al., 2021; Liu et al., 2021; Wang et al., 2021; Yuan et al., 2021b; Chen et al., 2021b

Convolution의 장점을 조합.

ViT-C (Xiao et al., 2021), CvT (Wu et al., 2021), BoTNet (Srinivas et al., 2021), ConViT (d’Ascoli et al., 2021), PiT (Heo et al., 2021)

대부분 무겁고 비슷한 성능 대비 parameter 수가 CNN 기반의 모델보다 많음.

Proposed Method

Figure 1: Visual transformers vs. Mobile ViT

MobileViT는

n x n convolution

MobileNetv2 block (MV2)

MobileViT block

으로 구성됩니다. (Figure 1. (b))

MobileViT block

더 적은 수의 파라미터로 local과 global information을 얻는 것을 목표로 합니다.

우선 local spatial information을 얻기 위해 standard n x n, 1x1 convolution을 차례로 적용하여

\mathbf{X}_{L} \in \mathbb{R}^{H \times W \times d}

를 추출합니다. (d > C)

그 다음 global representation을 학습하기 위해 Unfold, Transformer 입력, Fold 과정을 거칩니다.

\mathbf{X}_{L}

를 N non-overlapping flattened patch

\mathbf{X}_{U} \in \mathbb{R}^{P \times N \times d}

로 unfold 후 transformer에 입력합니다.

P = w*h (w, h : patch의 size), N = H*W/P (W, H : 전체 이미지의 크기)

MobileViT는 patch의 순서와 각 patch의 pixel의 spatial order를 잃지 않기 때문에 다시 원래의 shape으로 fold 할 수 있습니다.

\mathbf{X}_{G} \in \mathbb{R}^{P \times N \times d}

fold 후 convolution, 입력 (HxWxC)과의 concat, 그리고 convolution의 과정을 거칩니다.

Architecture

크기에 따라 S(small), XS(extra small), XXS(extra extra small) 세가지 모델이 있습니다.

전체적인 구조는 Figure 1과 같습니다.

MobileViT block의 n은 3, h와 w는 2로 통일하였습니다.

MV2 block은 전체 parameter에서 차지하는 비중이 작으며 주로 down-sampling의 역할을 한다고 합니다.

32x32, 16x16, 8x8위치에서 L={2, 4, 3}, d={96, 120, 144} 크기의 transformer를 사용하였을때

DeIT (L=12, d=192)와 비교하여 1.85배 빠르고, 2배 작으며, +1.8% 성능이 높습니다.

Experiment

ImageNet-1k 에 대한 classification 성능

CNN 기반의 모델과 비교

ViT 기반의 모델과 비교

General-purpose backbone으로서의 평가

Mobile object detection

Mobile Semantic segmentation

Mobile device에서의 성능

CoreMLTools를 이용하여 CoreML 형식으로 모델 변환.

iPhone 12에서 100번에 대한 평균을 비교.

MobileViT를 포함한 ViT 계열들이 mobile device에서 MobileNetv2보다 느린 것을 확인 할 수 있습니다. 저자에 의하면 두가지 이유가 있는데, 우선 GPU에서는 transformer의 dedicated CUDA kernel이 존재하지만 mobile device에서는 없다는 점이고, CNN은 device level의 optimization이 잘 되어 있다는 점입니다.(ex. batch norm. & conv. fusion). 이러한 이유 때문에 mobile device에서 MobileViT의 속도는 sub-optimal이며 앞으로 device level의 연산의 최적화를 통해 속도가 향상될 것이라고 합니다.

Conclusions

Transformer의 global attention과 CNN의 local inductive bias 등의 장점을 포함한 light weight model.

light-weight CNN-based model 대비 우수한 성능.

Real-time inference가 가능하지만 CNN-based model에 비하면 많이 느림. (device-level operation optimization 을 통한 개선의 여지가 있음)