I've read the paper MobileNetV2(arXiv:1801.04381)
and ran the model from Tensorflow model zoo.
I noticed that the inference time of SSD Lite MobileNetV2 is faster than SSD MobileNetV2.
At the MobileNetV2 paper, there is only a short explanation about SSD Lite in the following sentence:
'We replace all the regular convolutions with separable convolutions (depthwise followed by 1 × 1 projection) in SSD prediction layers'.
So my question is, what is the difference between SSD and SSD Lite?
I don't understand the difference because when the MobileNetV1(arXiv:1704.04861v1) was published and applied to SSD, it has already replaced all the convolutional layers to depthwise separable convolutions that were mentioned above.
The SSDlite is an adaptation of SSD which was first briefly introduced on the MobileNetV2 paper and later reused on the MobileNetV3 paper. Because the main focus of the two papers was to introduce novel CNN architectures, most of the implementation details of SSDlite were not clarified.
The mobilenet-ssd model is a Single-Shot multibox Detection (SSD) network intended to perform object detection. This model is implemented using the Caffe* framework.
It is frustrating since all searches for SSDLite result in "a novel framework we call SSDLite" so I was expecting a thing. However, I suspect that SSDLite is simply implemented by one modification (kernel_size) and two additions (use_depthwise) to the common SSD model file.
Comparing the model files ssd_mobilenet_v1_coco.config and sdlite_mobilenet_v2_coco.config produces the following:
model {
ssd {
box_predictor {
convolutional_box_predictor {
kernel_size: 3
use_depthwise: true
}
}
feature_extractor {
use_depthwise: true
}
}
}
I'll have to try it out.
As one of the answer already pointed out, the main differences in configs were the two use_depthwise options for both box_predictor and feature_extractor. The underlying changes had already been implemented in the codebase which essentially replace all regular convolutions in SSD layers and the last box+class prediction layer to depthwise + pointwise separable convolutions. The theoretical parameter and flops saving were described in our MobilenetV2 paper.
Also to answer the question of @Seongkyun Han, we did not replace all the conv in SSD layers in our v1 paper (only all the layers that belong to mobilenet were separable conv).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With