Exploring Plain Vision Transformer Backbones for Object Detection

Jun 28, 2022

Exploring Plain Vision Transformer Backbones for Object Detection

Contents

Introduction

Object detection is a fundamental computer vision task, typically performed by detectors comprising a task-agnostic backbone and independently developed necks and heads that incorporate detection-specific prior knowledge. Due to the de-facto design of Convolutional Networks (ConvNets), the most commonly used backbones have been multi-scale, hierarchical architectures.

While recently introduced vision transformers (ViTs) have shown their potential as backbones for visual recognition tasks, the original ViT is a plain, non-hierarchical architecture that maintains a single-scale feature map throughout, making it less effective than ConvNets for object detection tasks, particularly when dealing with multi-scale objects and high-resolution images. As such, we may ask: Is a plain ViT too inefficient for use on high-resolution image detection tasks, and should we instead re-introduce hierarchical designs into the backbone?

The new Meta AI paper Exploring Plain Vision Transformer Backbones for Object Detection makes the case for an effective use of the plain, non-hierarchical ViT as a backbone network for object detection — proposing a design that enables the original ViT to be fine-tuned for object detection without the need to redesign a hierarchical backbone for pretraining. The paper notes this decoupling of pretraining design from fine-tuning demands maintains the independence of upstream vs. downstream tasks, as has been the case for ConvNet-based research.

Related Works

In this section paper mainly discuss about the traditional Object Detectors in 3 aspects:

Object detector backbones - starting from one-scale till state-of-art hierarchical backend designs with FPN

Plain-backbone detectors - UViT approach (using single scale for detection heads) → proposed approach Simple Feature Pyramid (Fig. 2-c)

Object detection methodologies - one-stage vs two-stage, anchor-based vs anchor-free, region-based vs query-based and finally as proposed in this paper: plain vs. hierarchical

Method

Simple Feature Pyramid

The proposed ViTDet builds a simple feature pyramid from only the last feature map of a plain ViT backbone and uses simple non-overlapping window attention to effectively extract features from high-resolution images. A small number of cross-window blocks — which could be global attention or convolutions — are also adopted to propagate information. These adaptations are all made only during fine-tuning and so do not affect pretraining.

An empirical study reveals that the ViTDet’s simple design achieves surprisingly good results, with the researchers concluding:

It is sufficient to build a simple feature pyramid from a single-scale feature map without the common feature pyramid networks (FPN) design.

It is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks.

If the backbone is non-hierarchical, the foundation of the FPN motivation is lost, as all the feature maps in the backbone are of the same resolution. In proposed scenario, we simply use only the last feature map from the backbone, which should have the strongest features. On this map, we apply a set of convolutions or deconvolutions in parallel to produce multi-scale feature maps.

Backbone adaptation

Backbone adaptation proposed in the paper includes one of the next modifications:

Global propagation. We perform global self-attention in the last block of each subset. As the number of global blocks is small, the memory and computation cost is feasible. This is similar to the hybrid window attention that was used jointly with FPN.

Convolutional propagation. As an alternative, we add an extra convolutional block after each subset. A convolutional block is a residual block that consists of one or more convolutions and an identity shortcut. The last layer in this block is initialized as zero, such that the initial status of the block is an identity. Initializing a block as identity allows us to insert it into any place in a pre-trained backbone without breaking the initial status of the backbone.

Experiments results

Implementation

Used vanilla ViTs:

ViT-B, ViT-L, ViT-H

Patch size = 16

Input image size: 1024×1024

Base frameworks:

Mask R-CNN
Cascade Mask R-CNN

Results

Conclusion

Overall, this work demonstrates that plain-backbone detection has significant potential in object detection tasks. The proposed approach largely maintains the independence of strong general-purpose backbones and downstream task-specific designs, a decoupling of pre-training from fine-tuning that the team hopes may also benefit and consolidate research efforts in the computer vision and natural language processing fields.

See more posts

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

June 28, 2022