Dynamic Head: Unifying Object Detection Heads with Attentions

Dec 02, 2021

Dynamic Head: Unifying Object Detection Heads with Attentions

Contents

Introduction

Paper was published by Microsoft researchers from Redmond division in Jun 2021.

Among the image data processing tasks performed by deep learning algorithms, there is object detection. Literally finding the answer to what objects are located in a given image and where.

Detector, has been developed by many researchers over a long period of time but still usually has a de-facto standard common structure unifying all previous works:

This common structure is the Backbone + Neck(sometimes) + Head structure

Note: Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.

The structure above is the structure of RetinaNet. The structure of Backbone + Head is clearly shown. It is similar to many other Single-Shot networks like SSD (single-shot detector),YOLO and etc.

Here (a) is the backbone and (c) is the head. The image is processed in the order of Backbone -> Head, and the features of the image are extracted in the Bakcbone and in the Head - box and class branches identify where and what objects are located.

Here (b) is the part called Neck. The presence of a Neck increases the performance of the model, but it is not essential.

Many researches showed that performance of the detector often depends on 'how well the head is made'.

Then what are the conditions for a good head? The author said "The challenges in developing a good object detection head can be summarized into three categories":

Scale-aware: Head should detect the model regardless of the size of the object.

Spatial-aware: It must detect that it is the same object no matter how it looks like (rotations, deformations and etc).

Task-aware: Objects should be expressed in various ways such as bounding box, center point, corner point and etc.

The author said that the most of other researches so far has focused on solving only one of the three conditions. It means that a head that satisfies all three conditions has not been researched and created.

An open problem still remains. That motivated authors has created a Head that satisfies all three conditions. That's the Dynamic Head.

The authors regarded the method of creating a head that satisfies all three conditions, called a unified head in the paper, as an attention learning problem.

Originally, they wanted to implement one full self-attention mechanism, but there was a problem that the amount of computation. So, they used a method of applying a self-attention mechanism for each scale-aware, spatial-aware, and task-aware dimension separately.

Related works

Here is a detailed explanation of Scale-awareness, Spatial-awareness, and Task-awareness. Citing papers and giving detailed explanations:

Scale-awareness:

Object detection via region-based fully convolutional networks (IPN)
Feature pyramid networks for object detection (FPN)
Path aggregation network for instance segmentation (FPN with down-up augmentation)
Scale-equalizing pyramid convolution for object detection. (3D - convolutions)

Spatial-awareness:

Object detection in 20 years: A survey (original convolutions)
Deep residual learning for image recognition (deeper networks)
Multi-scale context aggregation by dilated convolutions
Deformable convolutional networks (and its improvements)

Task-awareness:

Object detection via region-based fully convolutional networks (2-stage)
Faster r-cnn: Towards real-time object detection with region proposal networks (RPN)
You only look once: Unified, real-time object detection (1-stage)
Mask R-CNN (segmentation branch)
Fcos: Fully convolutional one-stage object detection
Reppoints: Point set representation for object detection
Centernet: Keypoint triplets for object detection.
Borderdet: Border feature for dense object detection

Motivation

Idea proposed in this paper is described on the following figure:

The authors wanted to handle scale-aware, spatial-aware, and task-aware self-attention in a unified object detection head. The figure above is the structure of that achieved the authors' wish. If you treat that continuous process as one block, it becomes a unified object detection head.

The structure is roughly described as follows:

Get the Feature Pyramid from Backbone.

Create a 4D tensor F of size [L x H x W x C] by increasing or decreasing H x W x C of the remaining feature maps according to the size of the Feature Pyramid, that is, at the median level.

Let H x W = S to make a 4D tensor F into a 3D tensor F of size [L x S x C].

By applying a scale-aware self-attention operation to F, we pay attention to features at a well-characterized level for each object in the image.

Spatial-aware self-attention operation is performed on F . Then, attention is paid to the features of the object that are commonly revealed in the space of all feature maps levels.

Task-aware attention is calculated. Then, each channel of F performs tasks such as bounding box and center point.

Here, steps 4 to 6 are 'combine multiple attentions on all three dimensions', which is the realization of performing each aspect of attention in one head.

Dynamic Head: Unifying with Attentions

Authors defines attention operator as:

Here W is self-attention operation, π(·) is attention function and F is feature map.

Authors implemented attention functions with simple 1x1 convolution layers.

As were mentioned earlier earlier that they wanted to process all aspects (Scale, Spatial, Task) at once, but the amount of computation was too much, so they decided to apply attention one by one. This can be expressed as an expression: This is an implementation of 3-sequential attentions.

where πL(·), πS(·), and πC(·) are correspondingly Scale-aware, Spatial-aware and Task-aware attentions.

Scale-aware attention

Scale-aware attention is formulated as:

where f(·) : Linear function composed of 1 x 1 convolutional layer σ(x) : max(0, min(1, x+1 )), hard-sigmoid function First, find the average values of Space and Channel features per Level of tensor F. Then, put this average value into a 1 x 1 convolution layer, perform Fully-connected operation, and put it in the hard-sigmoid function.

Hard-sigmoid function: An approximate version of sigmoid. The range of the function is -1 <= σ(x) <= 1. This has the advantage of speeding up training because the amount of computation is reduced.

The author says that the reason for using πL(·) is 'to flexibly coalesce features of different scales based on semantic significance'.

Spatial-aware attention

Spatial-aware attention is formulated as:

The formula is more complex than πL:

K : Number of sparse sampling locations. pk + ∆pk : The position displaced by the self-learned spatial offset ∆pk. ∆mk : This is called self-learned importance scalar at location pk. And ∆pk and ∆mk are said to be learned by features at the median level of F.

Deformable convolution:

Task-aware attention

Task aware attention is explained in next formula:

Fc is a feature slice of F corresponding to the c-th channel

[α1, α2, β1, β2]T = θ(·) : Hyper function that controls activation thresholds. θ(·) is implemented as a function that performs operations in the order [global average pooling for L x S dimension -> 2 fully connected layers -> normalization layer -> shifted sigmoid function]. The range of the output value is [-1, 1] by the shifted sigmoid function.

The author said that the reason for using πC(·) is 'to enable joint learning and to generalize different representations of objects'. πC(·) flexibly turns on and off the appropriate channel of features. The channel of feature passed through πC(·) contains values for each task, such as bounding box, center point, and classification.

Unified dynamic head block

Finaly separate self attentions can be stacked into unite Dynamic Head block

As self-attention blocks are stacked together into Dynamic Head block - Dynamic Head blocks can be stacked also in order to improve performance of whole detection framework.

Dynamic Head block in details looks like:

Generalizing to Existing Detectors

Also in the paper authors give a guide how to apply a Dynamic Head to an existing detector frameworks. There is a different application method for each One-stage Detector and Two-stage Detector.

One-stage Detector: is simple. After removing the existing head, connect one unified branch made of Dynamic Head to Backbone. Because Attention Block can be used multiple times, it is expressed as one unified branch rather than an Attention Block.

The author said that compared to the existing One-stage Detector, it is very simple and the performance is also improved.

And although the expression method of the object is diverse, Dynamic Head said that it can be used in various models by saying that it is flexible in the expression method (task) of various objects.

Two-stage detector: unlike the One-stage Detector that supported various object expression methods, the Two-stage Detector supports only the Bounding box.

The reason is not stated, but when applying the dynamic head to the two-stage detector, spatial-aware attnetion is performed, pooling by object with RoI pooling, task-aware attention, and calculations up to RoI pooling were performed.

Experiment results

Dataset used for evaluation is: MS-COCO

Ablation study of separate self-attention parts were carried:

Scale-aware attention

Figure above shows the 'trend of the learned scale ratios'. This shows the distribution of the scale ratio for the values of the feature map at each level.

Spatial-aware attention

The above data is a visualization of attention after performing each operation. Before attention (Backbone), it is very noisy. It was not paying attention properly.

However, you can see that the more attention goes through, the more attention is paid to the object in the photo. Block 2 is a picture showing attention after going through [πL(·) -> πS(·) -> πC(·)] twice. Block 4 is a picture that has been passed 4 times and Block 6 is 6 times.

The author says that the above visualization proves the effect of spatial-aware attention learning well.

Efficiency on the Depth of Head

Study on optimal number of Dynamic Head blocks and its tradeoffs:

Generalization on Existing Object Detectors

Experiments on existing frameworks shows a solid improvement in performance: 1.2%~3.2% AP in average

Cooperate with Different Backbones

Conclusion

The authors implemented a head that contains scale-aware, spatial-aware, and task-aware attention in one framework. It can be applied to various detectors, and it has been shown that the performance improves every time it is applied.

Authors also said that the use of an attention mechanism for head design and training, such as a dynamic head, provided an interesting direction that deserves more attention.

And finally, paper presented two items that can be further developed when attention is used for the head, such as a dynamic head.

To easily train models that pay attention to all aspects (full-attention) and to increase computational efficiency.

To systematically put more various aspects of attention into the head to improve performance.