EfficientDet: Towards Scalable And Efficient Object Detection

Table of contents:

EfficientDet: Towards Scalable And Efficient Object Detection
EfficientDet: Towards Scalable And Efficient Object Detection

Video: EfficientDet: Towards Scalable And Efficient Object Detection

Video: EfficientDet: Towards Scalable And Efficient Object Detection
Video: [DeepReader] EfficientDet: Scalable and Efficient Object Detection 2024, December
Anonim

As one of the main applications in computer vision, object detection is becoming increasingly important in scenarios that require high precision but have limited computing resources, such as robotics and driverless cars. Unfortunately, many modern high-precision detectors do not meet these limitations. More importantly, real-world object detection applications run on different platforms, which often require different resources.

Scalable and efficient object detection
Scalable and efficient object detection

So the natural question is how to design accurate and efficient object detectors that can also adapt to a wide range of resource constraints?

EfficientDet: Scalable and Efficient Object Detection, adopted at CVPR 2020, introduces a new family of scalable and efficient object detectors. Building on previous work on scaling neural networks (EfficientNet) and incorporating a new bi-directional functional network (BiFPN) and new scaling rules, EfficientDet achieves modern accuracy while 9 times smaller and uses significantly less computation than known modern detectors. The following figure shows the general network architecture of the models.

Image
Image

Optimizing Model Architecture

The idea behind EfficientDet stems from an effort to find solutions to improve computational efficiency by systematically examining previous state-of-the-art detection models. In general, object detectors have three main components: a backbone that extracts features from a given image; a network of objects that takes multiple levels of functions from the backbone as input and outputs a list of combined functions that represent characteristic characteristics of the image; and a final class / box network that uses combined functions to predict the class and location of each object.

After reviewing the design options for these components, we identified several key optimizations to improve performance and efficiency. Previous detectors mostly use ResNets, ResNeXt or AmoebaNet as backbones, which are either less powerful or have lower efficiency than EfficientNets. With the initial implementation of the EfficientNet backbone, much more efficiency can be achieved. For example, starting with a RetinaNet baseline that uses a ResNet-50 backbone, our ablation study shows that simply replacing ResNet-50 with EfficientNet-B3 can improve accuracy by 3% while reducing computation by 20%. Another optimization is to improve the efficiency of functional networks. While most of the previous detectors simply use the Downstream Pyramid Network (FPN), we find that the downstream FPN is inherently limited to a one-way flow of information. Alternative FPNs such as PANet add additional upstream at the cost of additional computation.

Recent attempts to use neural architecture search (NAS) have found a more complex NAS-FPN architecture. However, while this network structure is effective, it is also irregular and highly optimized for a specific task, making it difficult to adapt to other tasks. To solve these problems, we propose a new network of bi-directional functions BiFPN, which implements the idea of combining multi-layer functions from FPN / PANet / NAS-FPN, which allows information to be transmitted both from top to bottom and from bottom to top. using regular and effective connections.

Image
Image

To further improve efficiency, we propose a new fast normalized synthesis technique. Traditional approaches usually treat all inputs to FPN the same way, even at different resolutions. However, we observe that input features with different resolutions often contribute unequally to the output functions. Thus, we add extra weight to each input function and let the network learn the importance of each of them. We will also replace all regular convolutions with less expensive, deeply separable convolutions. With this optimization, our BiFPN further improves accuracy by 4% while reducing computational costs by 50%.

The third optimization involves achieving the best compromise between accuracy and efficiency under various resource constraints. Our previous work has shown that co-scaling the depth, width, and resolution of a network can significantly improve image recognition performance. Inspired by this idea, we propose a new composite scaling method for object detectors that collectively increases the resolution / depth / width. Each network component, ie backbone, object and block / class predictive network, will have one complex scaling factor that controls all scaling dimensions using heuristic rules. This approach makes it easy to determine how to scale the model by calculating a scale factor for a given target resource constraint.

By combining the new backbone and BiFPN, we first design a small EfficientDet-D0 baseline and then apply compound scaling to get EfficientDet-D1 to D7. Each serial model has a higher computational cost, covering a wide range of resource constraints from 3 billion FLOPs to 300 billion FLOPS, and provides higher accuracy.

Performance model

Evaluating EfficientDet on the COCO dataset, a widely used reference dataset for object detection. EfficientDet-D7 achieves an average average accuracy (mAP) of 52.2, which is 1.5 points higher than the previous modern model, using 4 times fewer parameters and 9.4 times fewer calculations

Image
Image

We also compared parameter size and CPU / GPU latency between EfficientDet and previous models. With similar accuracy constraints, EfficientDet models run 2–4 times faster on the GPU and 5–11 times faster on the processor than other detectors. While EfficientDet models are primarily designed for object detection, we also test their effectiveness in other tasks such as semantic segmentation. To perform segmentation tasks, we slightly modify EfficientDet-D4 by replacing the detection head and head loss and loss while maintaining the same scaled backbone and BiFPN. We compare this model to previous modern segmentation models for Pascal VOC 2012, a widely used segmentation testing dataset.

Image
Image

Given their exceptional performance, EfficientDet is expected to serve as a new foundation for future object detection research and potentially make highly accurate object detection models useful in many real-world applications. So opened all the breakpoints of the code and pretrained model on Github.com.

Recommended: