YOLO Detector Family - Overview and History

Too many objects in the image make it extremely crowded. This creates various challenges for the object detect model, like the occlusions could be large, the objects could be small, and the scale could be inconsistent.

Inc Lomin

Oct 11, 2022

YOLO Detector Family - Overview and History

Contents

Introduction Object detection main challenges Overview of Object Detection History Single vs Two Stage object detectors YOLOv1 YOLOv2 and YOLO9000 YOLOv3 YOLOv4 Scaled YOLOv4 PP-YOLOv1 YOLOv5 PP-YOLOv2 YOLOX YOLOR

Introduction

Object detection main challenges

Crowded or Cluttered Scenario: Too many objects in the image make it extremely crowded. This creates various challenges for the object detect model, like the occlusions could be large, the objects could be small, and the scale could be inconsistent.

Intra-Class Variance: Another major challenge for object detection is to correctly detect objects of the same class, which can have high variance.

On example there are six breeds of dogs, and all of them have different sizes, colors, fur length, ears, etc., so detecting these objects of the same class can be challenging.

Class Imbalance: It is a challenge that impacts almost all the modalities, be it an image, text, time-series; more specifically, in the image domain, image classification struggles a lot, and object detection is no exception. We call it a foreground-background class imbalance in object detection. To understand how class imbalance could pose a problem in object detection, consider an image containing very few primary objects. The remainder of the image is filled with the background. As a result, the model would look at many regions in the image (dataset) where most regions would be considered negatives. Because of these negatives, the model learns no useful information and can overwhelm the entire training of the model.

Many other challenges are associated with object detection like:

Occlusion

Deformation

Viewpoint variation

Illumination conditions

Speed for real-time detection (required in many industrial applications)

Overview of Object Detection History

In the history of object detection, there have been two distinct eras:

The traditional computer vision approaches were in the game until 2010,

From 2012, a new era of convolutional neural networks started when AlexNet won the challenge.

Single vs Two Stage object detectors

Single-Stage Object Detectors are a class of object detection architectures that are one-stage. They treat object detection as a simple regression problem. For example, the input image fed to the network directly outputs the class probabilities and bounding box coordinates.

These models skip the region proposal stage, also known as Region Proposal Network, which is generally part of Two-Stage Object Detectors that are areas of the image that could contain an object.

YOLOv1

Paper link

https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf

Main ideas

YOLO (you only look once) was a breakthrough in the object detection field as it was the first single-stage object detector approach that treated detection as a regression problem. The detection architecture only looked once at the image to predict the location of the objects and their class labels.

Unlike the two-stage detector approach (Fast RCNN, Faster RCNN), YOLOv1 does not have a proposal generator and refine stages; it uses a single neural network that predicts class probabilities and bounding box coordinates from an entire image in one pass. It can be optimized end-to-end since the detection pipeline is essentially one network; think of it as an image classification network.

Since the network is designed to train in an end-to-end fashion similar to image classification, the architecture is extremely fast, and the base YOLO model predicts images at 45 FPS (Frames Per Second) benchmarked on a Titan X GPU. The authors also came up with a much lighter version of YOLO called Fast YOLO, having fewer layers that process images at 155 FPS.

Results

YOLOv2 and YOLO9000

Paper link

https://openaccess.thecvf.com/content_cvpr_2017/papers/Redmon_YOLO9000_Better_Faster_CVPR_2017_paper.pdf

Main ideas

Darknet-19

It was inspired mainly by the prior work; similar to VGG-16, it used 3x3 filters and doubled the number of channels after every pooling step utilizing a total of 5 pooling layers.

Instead of fully connected layers, they used global average pooling to make predictions and 1x1 filters to compress the feature representation between 3x3 convolutions.

A fully convolutional model with 19 convolutional layers and five max-pooling layers was designed.

Batch Normalization

Adding a batch normalization layer in all of the convolutional layers in YOLO improved the mAP by 2%.

It helped improve the network training convergence and eliminated the need for other regularization techniques like dropout without the network getting overfitted on the training data.

High Resolution Classifier

In YOLOv1, an image classification task was performed as a pretraining step on the ImageNet dataset at input resolution 224x224 , later upscaled to for object detection. Because of this, the network had to simultaneously switch to learning object detection and adjust to the new input resolution. That could have been a problem for the network weights to adapt to this new resolution while learning the detection task.

In YOLOv2, authors perform the pretraining classification step with 224x224. Still, they fine-tune the classification network at the upscaled 448x448 resolution for ten epochs on the same ImageNet data.

Finally, they fine-tuned the network for the detection task, and the high-resolution classifier approach increased the mAP by close to 4%. And trust me, a gain of 4% in mAP is a considerable boost.

Convolutional With Anchor Boxes

YOLOv1 was an anchor-free model that predicted the coordinates of B-boxes directly using fully connected layers in each grid cell.

Inspired by Faster-RCNN that predicts B-boxes using hand-picked priors known as anchor boxes, YOLOv2 also works on the same principle.

YOLOv2 removes the fully connected layers and uses anchor boxes to predict bounding boxes. Hence, making it fully convolutional.

In YOLOv1, the output feature map was size 7x7 and downsampled the image by 32. In YOLOv2, authors choose 13x13 as the output. There are mainly two reasons for this output size:

allowing more objects to get detected per image
an odd number of locations will have only a single center cell that will help capture large objects that tend to occupy the center of the image

To achieve an output size of 13x13, the input resolution is changed to 416x416 from 448x448, and one max-pooling layer is eliminated to produce a higher resolution output feature map.

Unlike YOLOv1, wherein each grid cell, the model predicted one set of class probabilities per grid cell, ignoring the number of boxes B, YOLOv2 predicted class and objectness for every anchor box.

Anchor boxes slightly decrease mAP, from 69.5mAP to 69.2mAP but increase the recall from 81% to 88%, meaning that the model has more room to improve.

YOLOv1 predicted 98 boxes per image, but YOLOv2 with anchor boxes can predict 845 boxes (13x13x5) per image and even more than a thousand based on the grid size.

Dimension Clusters

Unlike Faster-RCNN, which used hand-picked anchor boxes, YOLOv2 used a smart technique to find anchor boxes for the PASCAL VOC and MS COCO datasets.

Authors thought that instead of using hand-picked anchor boxes, its smarter to pick better priors that reflect the data more closely. It would be a great starting point for the network, and it would become much easier for the network to predict the detections and optimize faster.

Using means clustering on the training set bounding boxes to find good anchor boxes or priors.

A standard means clustering technique uses Euclidean distance as a distance metric to find cluster centers.

Experiments showed that K=5 is a good trade-off between model complexity and high recall. The model complexity would increase with an increase in the number of anchors.

Direct location prediction

In YOLOv1, we directly predicted the center (x,y) locations for the bounding box, which caused model instability, especially during the early iterations. Furthermore, since in YOLOv1, there was no concept of priors, directly predicting box locations led to a more significant loss as the model had no idea about the objects in the dataset.

However, in YOLOv2, with the concept of anchors, authors still follow the approach of YOLOv1 and predict location coordinates relative to the location of the grid cell, but the model outputs the offsets.

Fine-grained features

YOLOv2 predicts detections over a 13x13 feature map, which works well for large objects, but detecting smaller objects can benefit from fine-grained features. Fine-grained features refer to feature maps from the earlier layers of the network.

While both Faster R-CNN and SSD (Single-Shot Detector) run the region proposal network at various layers (feature maps) in the network for multiple resolutions, YOLOv2 adds a passthrough layer.

The passthrough layer was partly inspired by the U-Net paper in which skip-connections were used to concatenate features between the encoder and decoder layers.

Similarly, YOLOv2 concatenates the high-resolution features with the low-resolution ones by stacking adjacent features into different channels. This could also be thought of as identity mappings in ResNet architecture.

Since the higher resolution feature map spatial dimensions mismatch with the low-resolution feature map, the high-resolution map 26x26x512 is turned into a 13x13x2048, which is then concatenated with the original 13x13x1024 features.

This concatenation expanded the feature map space to , providing access to fine-grained features.

The use of fine-grained features helped improve the YOLOv2 model by 1%.

Multi-scale training

The YOLOv1 model was trained with an input resolution of 448x448 and used fully connected layers to predict bounding boxes and class labels. However, in YOLOv2, with the addition of anchor boxes, the resolution changed to 416x416; moreover, the network had no fully connected layers. It was a fully convolutional network with just convolutional and pooling layers. Hence, the input to the network could be resized on the fly while training the model.

The network input is varied every few iterations. After every ten batches, the network randomly chooses a new input resolution. Recall in convolution with anchor boxes we discussed the network downsamples the image by a factor of 32, so it chooses from following resolutions: {320, 352,384,416,…,608}.

This type of training allows the network to predict at different image resolutions. The network predicts much faster at smaller size input offering a tradeoff between speed and accuracy. The larger size input predicts relatively slower compared to the smallest but achieves the maximum accuracy.

Multi-scale training also helps avoid overfitting because we force the model to be trained with different modalities.

At test time, we can resize the images to many different sizes without modifying the trained weights.

At low resolution 288x288, YOLOv2 runs at more than 90 FPS with an mAP of 69.0, close to Fast R-CNN. Of course, there’s no comparison in terms of FPS. You could use a low-resolution variant on GPU with fewer CUDA cores or older architectures and even deploy the optimized version on embedded devices like Jetson Nano, Xavier NX, Intel Neural Compute Stick.

High resolution (i.e., 544x544) outperforms all the other detection frameworks becoming the state-of-the-art detector with 78.6 mAP while still achieving more than real-time speed.

The multi-scale training approach produced a 1.5% boost in mAP.

Extentions for YOLO9000

Hierarchical Training

Combining ImageNet and MS COCO with word tree

Joint Training for classification and detection

Results

YOLOv3

Paper link

https://arxiv.org/pdf/1804.02767.pdf

Main ideas

Darknet53

Darknet-53 architecture consisting of 53 convolutional layers that act as a base for the object detection network or a feature extractor. The 53 layers are pretrained on the image classification task using the ImageNet dataset.

For the object detection task, 53 more layers are stacked on top of the base/backbone network, making it a total of 106 layers, and we get the final model known as YOLOv3.

Inspired by new classification networks like ResNet, DenseNet, etc., YOLOv3 was deeper than its predecessor and borrowed ideas like residual blocks (skip connection with addition) and skip connection with concatenation to avoid vanishing gradient problems and help propagate information that helped predict objects at different scales.

The important parts to note in YOLOv3 network architecture are residual blocks, skip connections, and upsampling layers.

Darknet-53 network architecture is more potent than Darknet-19 and more efficient than ResNet-101 and ResNet-152.

Pre-training stage

The Darknet-53 architecture was trained on the image classification task with the ImageNet dataset in the pretraining step.

The number of filters starts with 32 and is doubled at every convolutional layer and a residual group. Each residual block has a bottleneck structure 1x1 filter followed by a 3x3 filter followed by a residual skip connection. Finally, for image classification in the last layers, we have a fully connected layer and a softmax function for outputting a 1000 class probability score.

Detection stage

In the detection step, the layers after the last residual group are removed (i.e., the classification head), giving us the backbone for our detector. Since YOLOv3 is supposed to detect objects at multiple scales at each of the last three residual groups, a detection layer is attached to make object detection predictions.

Assuming the input to the network is 416x416 the three feature vectors we obtain are 52x52, 26x26 and 13x13 responsible for detecting small, medium, and large objects, respectively.

Results

YOLOv4

Paper link

https://arxiv.org/pdf/2004.10934v1.pdf

Main ideas

Final architecture

Selected BoF and BoS

CutMix

CutMix works similar to the ‘Cutout’ method of image augmentation, rather than cropping a part of the image and replacing it with 0 values. Instead, the CutMix method replaces it with part of a different image.

Cutouts of the image force the model to make predictions based on a robust number of features. Without cutouts, the model relies specifically on a dog’s head to make a prediction. That is problematic if we want to accurately recognize a dog whose head is hidden (perhaps behind a bush).

Mosaic Data Augmentation

Mosaic augmentation stitches four training images into one image in specific ratios (instead of only two in CutMix).

The network sees more context information within one image and even outside their normal context.

Allows the model to learn how to identify objects at a smaller scale than usual.

Batch normalization would have a 4x reduction because it will calculate activation statistics for four different images at each layer. This would reduce the need for a large mini-batch size during training.

Class Labeling Smoothing

Class label smoothing is a regularization technique used in a multi-class classification problem in which class labels are modified. Generally, for a problem statement involving three classes: cat, dog, elephant, the correct classification for a bounding box would represent a one-hot vector of classes [0,0,1], and the loss function is calculated based on this representation.

However, rough labeling would force the model to reach positive infinity for that last element one and negative infinity for the zeros. This would make the model overfit the training data as it would learn to become super good and be overly sure with a prediction close to 1.0. Still, in reality, it is often wrong, overfit, and overlooks the complexities of other predictions somehow.

Following this intuition, it is more reasonable to encode the class label representation to value that uncertainty to some degree. Naturally, the authors choose 0.9, so [0,0,0.9] to represent the correct class. And now, the model’s goal is not to be 100% accurate in predicting the class cat. The same could be applied to the classes that are zero and can be modified to 0.05 or 0.1.

Self-Adversarial Training

It is well known that the neural network tends to perform poorly even when there is a minimum perturbation in the input image. For example, given a cat image as input with minimum perturbation, the network could classify it as a traffic light even if both the images look visually the same. The human vision is unaffected by the perturbation, but a neural network suffers from this attack, and you need to force the network to learn that both the images are the same.

Figure: Perturbation applied on the cat image (source: Shafahi et al., 2020).

The self-adversarial training is done in two forward and backward stages. In the first stage, it performs a forward pass on a training sample. Generally, we adjust the model weights in the backpropagation to improve the model in detecting objects in this image. But here, it goes in the reverse direction like gradient ascent. As a result, it perturbs the image such that it can degrade the detector performance the most. It creates an adversarial attack targeted at the current model even though the new image may look visually similar. In the second stage, the model is trained with this new perturbed image with the original boundary box and class label. This helps create a robust model that generalizes well and reduces overfitting.

Cross-Stage Partial Connection (CSP)

The YOLOv4 authors were inspired by the CSPNet paper that showed that adding cross-stage partial connections to ResNet, ResNext, and DenseNet reduced computation cost and memory usage of these networks and benefited the inference speed and accuracy.

CSPNet separates the input feature maps or the base layer of the DenseBlock into two parts. The first part bypasses the DenseBlock and goes directly as an input to the transition layer. The second part goes through the Dense block.

The convolutional layer is applied to the input feature map, and the convolutional output is concatenated with the input, followed throughout the dense block sequentially. However, the CSP takes only the partial part of the input feature map into the dense block, and the remaining directly goes as an input to the transition layer. This new design reduces the computational complexity by separating the input into two parts, with only one going through the Dense Block.

Modified Spatial Pyramid Pooling

Path Aggregation Network

Spatial Attention Module

DropBlock Regularization

The DropBlock regularization technique is similar to the Dropout regularization used to prevent overfitting. However, in Dropout block regularization, the dropped feature points are no longer spread randomly but are combined into blocks, and the entire block is dropped. Dropping out of activations at random is ineffective in removing semantic information because nearby activations contain closely related information. Instead, dropping continuous regions can remove certain semantic information (e.g., head or feet) and consequently enforce remaining units to learn features for classifying input images.

Results

Scaled YOLOv4

Paper link

https://arxiv.org/pdf/2011.08036.pdf

Main ideas

Convolutional neural network architecture can be scaled in three dimensions: depth, width, and resolution. The depth of the network corresponds to the number of layers in a network. The width is associated with the number of filters or channels in a convolutional layer. Finally, the resolution is simply the height and width of the input image.

And that’s what scaled-YOLOv4 also tries to do, that is, uses optimal network scaling techniques to achieve YOLOv4-CSP -> P5 -> P6 -> P7 detection networks.

Improvements

Scaled-YOLOv4 uses optimal network scaling techniques to achieve YOLOv4-CSP -> P5 -> P6 -> P7 networks.

Modified activations for width and height, which allows faster network training.

Improved network architecture: Backbone optimized and Neck (Path-Aggregation Network) uses CSP connections and Mish activation.

Exponential Moving Average (EMA) is used during the training.

For each resolution of the network, a separate network is trained, while in YOLOv4 single network was trained on multiple resolutions.

Results

PP-YOLOv1

Paper link

https://arxiv.org/pdf/2007.12099v3.pdf

Main ideas

Till now, we have seen YOLO in two different frameworks, namely Darknet and PyTorch; however, there is a third framework in which YOLO was implemented called PaddlePaddle framework, hence the name PP-YOLO. PaddlePaddle is a deep learning framework written by Baidu, which has a massive repository of Computer Vision and Natural Language Processing models.

PP-YOLO is part of PaddleDetection, an end-to-end object detection development kit based on the PaddlePaddle framework. It provides a ton of object detection architectures, backbones, data augmentation techniques, components (like losses, feature pyramid network, etc.) that can be combined in different configurations to design the best object detection network.

In short, it provides image processing capabilities such as object detection, instance segmentation, multi-object tracking, keypoint detection, which ease the process of object detection in construction, training, optimization, and deployment of these models in a faster and better way.

The PP-YOLO detector is divided into three parts:

Backbone: The backbone in an object detector is a fully convolutional network that helps extract feature maps from the image. It is similar in spirit to a pre-trained image classification model. Instead of using the Darknet-53 architecture (in YOLOv3 and YOLOv4), the proposed model used a ResNet50-vd-dcn as the backbone. In the proposed backbone model, the 3×3 convolution layer is replaced by deformable convolutions in the last stage of the architecture. The number of parameters and FLOPs of ResNet50-vd are much smaller than those of Darknet-53. This helped in achieving a slightly higher mAP of 39.1 compared to YOLOv3.

Detection Neck: The Feature Pyramid Network (FPN) creates a pyramid of features by lateral connections between the feature maps. If you look closely at the below figure, feature maps from stages C3, C4, and C5 are fed as an input to the FPN module.

Detection Head: The detection head is the final part of the object detection pipeline that predicts the bounding box (localization) and classification of the objects. The head of PP-YOLO is the same as the YOLOv3 head. Predicting the final output uses a 3×3 convolution layer followed by a 1×1 convolution layer.

In the above PP-YOLO architecture, the diamond inject points denote the coord-conv layers, purple triangles represent the DropBlocks, and the red star mark indicates the Spatial Pyramid Pooling.

New features

Larger Batch Size: Leveraging a larger batch size helps stabilize the training and lets the model produce better results. The batch size is changed from 64 to 192, and accordingly, the learning rate and training schedule are also updated.

Exponential Moving Average: The authors claim that using moving averages of the trained parameters produced better results during inference.

DropBlock Regularization: It is a technique similar to the Dropout regularization used to prevent overfitting. However, in Dropout block regularization, the dropped feature points are no longer spread randomly but are combined into blocks, and the entire block is dropped.

Intersection over Union (IoU) Loss: An extra loss (i.e., IoU loss) is added to train the model, while the existing L1 loss is used in YOLOv3, and most YOLO architectures are not replaced. An extra branch is added to calculate the IoU loss. This is done as the mAP evaluation metric strongly relies on the IoU.

IoU Aware: Since localization accuracy is not considered in the final detection confidence, an IoU prediction branch is added to measure the localization accuracy. While during the inference predicted IoU score is multiplied by the classification probability and objectiveness score to predict the final detection confidence.

Matrix Non-Maximum Suppression (NMS): A parallel implementation of a soft NMS version is used faster than traditional NMS and does not bring any loss of efficiency. The soft NMS works sequentially and cannot be implemented in parallel.

Spatial Pyramid Pooling (SPP) Layer: The SPP layer implemented in YOLOv4 is applied in PP-YOLO as well but only in the top feature map. Adding SPP adds 2% of the model parameters and 1% FLOPS, but this lets the model increase the receptive field of the feature.

Better Pretrained Model: A pre-trained model with better classification accuracy on ImageNet is used, resulting in better detection performance. A distilled ResNet50-vd model is used as the pretrain model.

Results

YOLOv5

Paper link

No paper yet…

Main ideas

Today, YOLOv5 is one of the official state-of-the-art models with tremendous support and is easier to use in production. The best part is that YOLOv5 is natively implemented in PyTorch, eliminating the Darknet framework’s limitations (based on C programming language and not built with production environments perspective). Darknet framework has evolved over time and is a great research framework to work with, training, fine-tuning, inference with TensorRT; all of this is possible with Darknet. However, it has a smaller community and hence, less support.

This huge change of YOLO in PyTorch made it easier for the developers to modify the architecture and export to many deployment environments straightforwardly. And not to forget, YOLOv5 is one of the official state-of-the-art models hosted in the Torch Hub showcase.


$ import torch
$ model = torch.hub.load('ultralytics/yolov5', 'yolov5s')  # or yolov5m, yolov5l
$ img = 'https://ultralytics.com/images/zidane.jpg'  # or file, Path, PIL, OpenCV
$ results = model(img)
$ results.print()  # or .show(), .save()

Training on Custom Dataset

Multi-GPU Training

Exporting the trained YOLOv5 model on TensorRT, CoreML, ONNX, and TFLite

Pruning the YOLOv5 architecture

Deployment with TensorRT

Moreover, they have developed an iOS application called iDetection, which offers four variants of YOLOv5. We tested the application on iPhone 13 Pro, and the results were impressive; the model runs detection at close to 30FPS.

Same as YOLOv4, the YOLO v5 uses Cross-Stage Partial Connections with Darknet-53 in the Backbone and Path Aggregation Network as the Neck. The major improvements include novel mosaic data augmentation (from YOLOv3 PyTorch implementation) and auto-learning bounding box anchors.

Results

PP-YOLOv2

Paper link

https://arxiv.org/pdf/2104.10419.pdf

Main ideas

Path Aggregation Network (PAN): To detect objects at different scales, the authors employ PAN in the neck of the object detection network. In PP-YOLO, Feature Pyramid Network was leveraged to compose bottom-up paths. Similar to YOLOv4, in PP-YOLOv2, the authors follow the design of PAN to aggregate the top-down information.

Mish Activation Function: The mish activation function is adopted in the neck of the detection network; since PP-YOLOv2 used the pre-trained parameters because of its robust 82.4% top-1 accuracy on the ImageNet classification dataset. It was proved effective in the backbone of various practical object detectors like YOLOv4 and YOLOv5.

Larger Input Size: Detecting smaller objects is often a challenge, and as the image traverses through the network, the information of the objects on a small scale is lost. Thus, in PP-YOLOv2, the input size is increased, enlarging the area of objects. As a result, performance will be increased. The largest input size, 608, is increased to 768. Since, larger input resolution occupies more memory, thus, the batch size is reduced from 24 images per GPU to 12 images per GPU drawing uniformly across different input sizes [320, 352, 384, 416, 448, 480, 512, 544, 576, 608, 640, 672, 704, 736, 768].

IoU Aware Branch: In PP-YOLO, IoU aware loss is calculated in a soft weight format inconsistent with the original intention. Thus in PP-YOLOv2, a soft label format better tunes the PP-YOLO’s loss function and makes it more aware of the overlap between bounding boxes.

Results

YOLOX

Paper link

https://arxiv.org/pdf/2107.08430.pdf

Main ideas

Till now, the only anchor-free YOLO object detector we learned was YOLOv1, but YOLOX too detects objects in an anchor-free manner. Moreover, it also conducts other advanced detection techniques like decoupled head, leverage robust data augmentation techniques, and leading label assignment strategy SimOTA to achieve state-of-the-art results.

YOLOX won the 1st Place on Streaming Perception Challenge (Workshop on Autonomous Driving conducted in conjunction with CVPR 2021) using a single YOLOX-L model.

YOLOX-L achieved 50.0% AP on COCO at a speed of 68.9 FPS on Tesla V100 with roughly the same parameters as YOLOv4- CSP, YOLOv5-L, exceeding YOLOv5-L by 1.8% AP.

YOLOv3 with Darknet-53 backbone is selected as the baseline. Then, a series of improvements were made to the base model.

Decoupled Head

Data Augmentation

Mosaic and MixUp data augmentation techniques similar to YOLOv4 were added to boost YOLOX performance. Mosaic is an efficient augmentation strategy proposed by ultralytics-YOLOv3.

Using the above two augmentation techniques, the authors found that pre-training the backbone on the ImageNet dataset was no more beneficial, so they trained the model from scratch.

Anchor Free Detection

To develop a high-speed object detector, the YOLOX adopted an anchor-free mechanism that reduces the number of design parameters since now we don’t have to deal with the anchor boxes anymore, which increased the number of predictions significantly. And so, for each location or grid in the prediction head, we now have only one prediction instead of predicting output for three different anchor boxes. Each object’s center location is considered a positive sample, and there is a predefined scaled range.

Simply put, in anchor-free detection, the predictions for each grid are reduced from 3 to 1, and it directly predicts four values, that is, two offsets in terms of the top-left corner of the grid and the height and width of the predicted box.

With this approach, the network parameters and GFLOPs of the detector are reduced, and it makes the detector faster and not just that even the performance improves to 42.9% AP.

To have a fair comparison, YOLOX replaces the Darknet-53 backbone with YOLOv5’s modified CSP v5 backbone along with SiLU activation and the PAN head. By leveraging its scaling rule YOLOX-S, YOLOX-M, YOLOX-L, and YOLOX-X models are produced.

Results

YOLOR

Paper link

https://arxiv.org/pdf/2105.04206v1.pdf

Main ideas

TBD

Results