ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

training objective와 architecture가 동시에 고려되지 않으면 최적의 성능가 구해졌다고 하기 어렵습니다. 따라서 저자는 학습 방법에 맞춰서 네트워크 아키텍쳐 역시 알맞게 같이 바뀌어야 한다고 말하고 있습니다.

Inc Lomin

Jun 22, 2023

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

Contents

Introduction Proposed Method and Experiments Conclusions References

ConvNeXt V2 논문을 간단하게 정리하면 다음과 같습니다.

Learn semi-supervised way (MAE) + architectural change(GRN) in ConvNeXtV1 to make semi-supervised way more effective (ConvNeXt V1 : learn supervised way)

Introduction

저자는 Visual representation learning system의 성능은 3가지 요인에 영향을 받는다고 합니다.

neural network architecture

ConvNets
Transformers,
ConvNeXt(modernized convnets)
…

training method

supervised
semi-supervised
…

training data

ImageNet
COCO
…

NN architecture design 적인 측면에서는 지금까지 많은 변화가 이루어졌지만, 여전히 ImageNet 데이터셋을 이용해 supervised learning 을 한 performance 를 가지고 새로운 아키텍쳐를 design하고 있습니다. 하지만 training objective와 architecture가 동시에 고려되지 않으면 최적의 성능가 구해졌다고 하기 어렵습니다. 따라서 저자는 학습 방법에 맞춰서 네트워크 아키텍쳐 역시 알맞게 같이 바뀌어야 한다고 말하고 있습니다.

Motivation

Issue : no pretrained models that show self-supervised learning can improve upon the best ConvNeXt supervised results.

vision transformer architecture에서 사용되는 MAE는 ConvNeXt 아키텍쳐에서 사용하기엔 적합하지 않다. 왜냐하면 ConvNets은 dense sliding window를 사용하지만, MAE는 트랜스포머의 시퀀스 처리능력에 최적화된 encoder-decoder 구조를 갖고 있기 때문입니다. 또한 경험적으로 transformer와 ConvNets이 representation quality 에 영향을 주는 다른 feature learning behavior를 갖고 있고 있다고 합니다.

Contributions

FCMAE (fully convolutional masked autoencoder)
GRN (global response normalization)

Related Works

ConvNets
MAE(masked autoencoders)

no pretrained models that show self-supervised learning can improve upon the best ConvNeXt supervised results.

Proposed Method and Experiments

FCMAE (fully convolutional masked autoencoder)

FCMAE framework : sparse convolution-based ConvNeXt encoder + lightweight ConvNeXt block decoder - encoder processes only the visible pixels, and the decoder reconstructs the image using the encoded pixels and mask tokens. The loss is calculated only on the masked region.

Masking

masking ratio : 0.6
hierarchical design의 conv model에서 umsample되기 직전 마지막 stage에서 mask를 생성합니다. 예를 들어 input image → 32x32 로 줄인 후 60퍼센트의 패치를 날린다고 합니다. (또한 저자는 random resized cropping이외의 다른 augmentation은 사용하지 않았습니다.)

Encoder Design

challenge : masked image modeling을 효과적으로 하기 위해선, 모델이 masked된 region을 copy & paste하는 shortcut 을 학습하는 것을 막아야 합니다. 이를 막는 것이 transformer-based model들에선 visible patch만 encoder의 input으로 사용하면 돼서 간단하지만, ConvNets에선 2D image구조가 지켜져야 하기 때문에 더 어렵습니다.
위의 challenge를 해결하기 위한 여러 방법중 한가지로 masked 된 image를 sparse data perspective에서 바라보았고, 자연스럽게 sparse convolution을 FCMAE framework에 사용하게 되었다고 합니다. (pre-training할 때는 sparse convolution layers를 사용하고, fine-tuning할 때는 이 sparse conv layers을 standard convolution layers 로 변경하였다고 합니다.)

→(제 생각) standard conv layer를 사용하면 모든 채널에서의 정보 교환이 이루어지기 때문에 masked region 이 copy & paste가 될 수 있습니다. 하지만 deep wise convolution같은 sparse convolution을 사용하면 이를 방지할 수 있습니다.

Our empirical findings show that it is es- sential to prevent information leakage from the masked re- gion in order to achieve good results.

Decoder Design

가벼운 plain ConvNeXt block을 사용하였다고 합니다. ( 따라서 최종적으로 asymmetric encoder-decoder architecture 가 됩니다.) decoder관련해서 여러가지 실험을 해보았는데 fine-tuning accuracy측면에서 좋은 퍼포먼스를 보였고, pre-training타임을 크게 줄일 수 있었기에 가벼운 ConvNeXt block을 사용했다고 합니다.

Reconstruction Target

MSE(mean squared error)를 reconstructed patch와 target patch를 적용했습니다.

FCMAE (실험 결과)

pre-train : ImageNet-1K, 800 epochs
fine-tune : ImageNet-1K, 100 epochs
top-1 IN-1K validation accuracy for a single 224×224 center crop

self-supervised vs supervised learning

FCMAE pre-training → 100 epoch fine-tuning 이 random baseline → 100 epoch 보다는 더 좋은 성능을 보여주었지만, random baseline → 300 epoch 한 것 보다는 안좋은 성능을 보였다고 합니다. (baseline결과는 ConvNeXt paper에서 가져왔습니다.)

하지만 이 결과는 transformer-based models을 사용한 masked image modeling 의 최근 성과와는 많이 다른 양상이 였고, ConvNeXt encoder가 MAE pre-training을 하는 과정에서 생기는 문제에 대해서 조사해보게 되는 계기가 되었다고 합니다.

Feature Collapse Phenomenon

Feature activation visualization. We visualize the activation map for each feature channel in small squares. For clarity, we display 64 channels in each visualization. The ConvNeXt V1 model suffers from a feature collapse issue, which is characterized by the presence of redundant activations (dead or saturated neurons) across channels. To fix this problem, we introduce a new method to promote feature diversity during training: the global response normalization (GRN) layer. This technique is applied to high-dimensional features in every block, leading to the development of the ConvNeXt V2 architecture.

위의 조사를 하던 중, FCMAE pre-trained ConvNeXt-base model 의 activation을 visualize하면서 feature collapsing현상을 발견하게 되었다고 합니다.

feature collapse phenomenon: there are many dead or saturated feature maps and the activation becomes redundant across channels

Feature cosine distance analysis. As the number of total layers varies for different architectures, we plot the distance values against the normalized layer indexes. We observe that the ConvNeXt V1 FCMAE pre-trained model exhibits severe feature collapse behavior. The supervised model also shows a reduction in feature diversity, but only in the final layers. This decrease in diversity in the supervised model is likely due to the use of the cross-entropy loss, which encourages the model to focus on class-discriminative features while suppressing the others.

higher distance value 를 가질 수록 더 다양한 feature가 있다는 뜻이고, lower value를 가질 수록 feature redundancy 가 있다는 걸 의미합니다.

( 위의 분석을 하기 위해서 ImageNet-1K validation set에서 1000개의 이미지를 랜덤하게 뽑고, 각각의 모델에서 high-dimensional feature를 뽑은 후, 각각의 이미지에 대해 layer간의 distance를 측정 한 다음, 모든 이미지에 대해서 측정한 distance값의 average를 사용했다고 합니다.)

위의 분석 이후에 feature를 diversify하고 feature collapse를 막을 방법에 대해서 고민하는 계기가 되었다고 합니다.

GRN (global response normalization)

각 채널의 contrast와 selectivity 를 증가시키기 위해 GRN이란 새로운 response normalization layer를 만들었습니다.

GRN unit은 1) global feature aggregation, 2) feature normalization, and 3) feature calibration 으로 이루어져 있습니다.

1) 의 feature aggregation으론 L2-norm을 사용하였고 (실험적으로 가장 좋은 결과가 나왔다고 합니다)

그리고, response normalization function 을 aggregated values에 적용합니다.

( 은 the i-th channel의 L2-norm value입니다.)

두번째 과정을 통해서 모든 다른 채널에 대한 relative importance를 계산하고, mutual inhibition을 통해 채널들 사이의 feature competition 을 일으킨다고 합니다.

this step creates a feature competition across channels by mutual inhibition

마지막으로, original input response를 feature normalization scores를 이용해 calibrate합니다.

optimization을 돕기 위해서 학습 가능한 파라미터 와 를 추가하였고, 0으로 초기화를 하였습니다.

final GRN block :

따라서 초기에는 identity function으로 작동하다가 training을 통해 점점 변경되는 구조가 됩니다.

ConvNeXt V2 Block vs ConvNeXt V1 Block

(GRN이 추가되었을 때 LayerScale은 실험적으로 불필요하다고 판단되었다고 합니다.)

Impact of GRN

GRN을 사용했을 때 FCMAE방식으로 pre-trained된 모델이 300 epoch supervised counterpart 보다 훨씬 더 좋은 퍼포먼스를 보여주고 있습니다.

(GRN의 각각의 부분에 대해서 실험한 결과입니다.)

(learning framework가 바뀌어도 architecture가 그에 맞춰서 바뀌지 않으면 큰 효과를 기대하기 어렵다는 걸 보여주는 테이블입니다.)

ImageNet Experiments

Transfer Learning Experiments

Conclusions

Learning objective에 맞춰서 Model architecture에 변화가 필요하다.

(Transformer ↔ MAE, ConvNets ↔ FCMAE)

GMN : Generative Multi-modal Network for Practical Document Information Extraction

June 22, 2023

YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors