End-to-End Object Detection with Fully Convolutional Network

Dec 02, 2021

End-to-End Object Detection with Fully Convolutional Network

Contents

Introduction Proposed Method Experiment Results Conclusions

Introduction

많은 mainstream detectors 들이 anchor-based labels assignment 나 non-maximum suppression(NMS)와 같은 hand-crafted design을 사용하고 있습니다.

이러한 방법들은 이미 훌륭한 성능을 보였지만 fully end-to-end training을 위해 NMS를 사용하지 않기 위한 시도들이 있어 왔습니다. (Learnable NMS, Soft NMS, and other NMS variants, CenterNet, DETR etc..) 그러나 NMS variants와 CenterNet의 경우 효과적인 duplicate removal을 위한 방법을 제안하지만 end-to-end training 방법을 제공하지 않습니다. DETR은 긴 학습 시간과 small object에 대해 낮은 성능을 보인다는 단점이 있습니다.

이 논문에서는 prediction-aware one-to-one (POTO) label assignment와 3D Max Filtering (3DMF)을 제안합니다. 또한 auxiliary loss 추가하므로써 baseline의 성능을 넘어서는 것을 보여줍니다.

Proposed Method

Prediction-aware One-To-One (POTO) label

여기서 는 각각 category label, bounding box coordinate를 의미합니다.

POTO의 목표는 적합한 permutation 을 찾는 것입니다. G와 N은 각각 ground truth의 개수, prediction의 개수입니다. (G << N)

기존 연구(DETR)에서는 이 문제를 bipartite matching 문제로 보고 foreground loss를 matching cost로 사용해 Hungarian algorithm으로 풀어냈습니다.

하지만 foreground loss는 optimization issue, 즉, unbalanced training samples, joint training of multiple tasks를 다루기 위해 추가적인 weights를 필요로 합니다. Table 1 이 방법이 최적이 아님을 확인 할 수 있습니다.

이에 따라 저자들은 better assignment를 위한 더 효과적인 formulation을 제안합니다.

여기서 는 i-th ground-truth와 -th prediction 사이의 matching quality를 의미합니다. 는 i-th ground truth에 대한 candidate predictions(i.e., spatial prior)를 의미합니다.

hyper-parameter 는 0.8을 사용했습니다.

Table 1.을 보면 POTO의 사용이 NMS 갭을 줄이는 것을 볼 수 있습니다.

3D Max Filtering

duplicated prediction의 분포를 확인하기 위해 저자들은 Table 2.에 보여지는 것 처럼 NMS를 scale과 spatial range에 대해 제약하는 것이 성능에 미치는 영향을 보았습니다.

Table 2.를 보면 scale에 따라 NMS를 따로 적용하는 경우 mAP가 1.7이 떨어지는 것을 확인 할 수 있습니다. 또한 spatial range를 제한하면 mAP가 크게 떨어지는 것을 확인하였습니다.

기존 연구(e.g., CenterNet, CornetNet)에서 max filter를 duplication 제거를 위한 새로운 post-processing step으로 사용하였으나 학습 불가능하고, single scale에만 적용되므로 FPN 구조에 최적이 아니라는 단점이 있습니다.

이에 저자들은 multi-scale 버전의 max filter인 3D Max Filtering을 제안하였습니다.

Auxiliary Loss

Table. 1을 보면 POTO+3DMF 모델이 여전히 FCOS baseline보다 성능이 떨어지는 것을 볼 수 있습니다.

이는 one-to-one label assignment의 less supervision한 특성 때문에 strong and robust feature representation을 학습하는데 어려움이 있기 때문이라고 분석했습니다.

이에 따라 one-to-many label assignment에 기반한 auxiliary loss 를 추가하였습니다.

Experiment

FCOS framework 사용

FCOS와 동일하게 classification과 regression에 각각 4개로 구성된 convolution head를 사용. (centerness branch 제거)

모든 backbone은 ImageNet dataset으로 pre-trained. (with frozen batch normalization.)

학습시 shorter side를 800으로 resize.

모든 특별히 언급이 없을 시 hyper-parameter는 Detectron2의 2x schedule과 동일.

Results

Visualization

FCOS baseline(one-to-many assign) (a)과 비교하여 제안된 방법(d)이 duplicate sample의 score가 suppress되는 것을 볼 수 있습니다.

Prediction-Aware One-to-One Label Assignment

3D Max Filtering

Performance w.r.t. training duration

Evaluation on CrowdHuman

CrowdHuman dataset은 COCO dataset에 비해 더 복잡하고 crowded 한 scene을 포함하고 있기 때문에 duplicate removal의 성능을 평가하기에 좋습니다. 제안한 방식이 CrowdHuman dataset에서 장점을 보이는 것을 확인할 수 있습니다.