Here's the link to the paper regarding MobileNet V3.
MobileNet V3
According to the paper, h-swish and Squeeze-and-excitation module are implemented in MobileNet V3, but they aim to enhance the accuracy and don't help boost the speed.
h-swish is faster than swish and helps enhance the accuracy, but is much slower than ReLU if I'm not mistaken.
SE also helps enhance the accuracy, but it increases the number of parameters of the network.
Am I missing something? I still have no idea how MobileNet V3 can be faster than V2 with what's said above implemented in V3.
I didn't mention the fact that they also modify the last part of their network as I plan to use MobileNet V3 as the backbone network and combine it with SSD layers for the detection purpose, so the last part of the network won't be used.
The following table, which can be found in the paper mentioned above, shows that V3 is still faster than V2 is.
Object detection results for comparison
consists of 28 layers, including deep convolution layer, 1 × 1 point convolution layer, batchnorm,ReLU, average collecting layer and softmax. Figure 3 shows the MobileNet architecture.
How Is It Different From MobileNetV1? The MobileNetV2 models are much faster in comparison to MobileNetV1. It uses 2 times fewer operations, has higher accuracy, needs 30 percent fewer parameters and is about 30-40 percent faster on a Google pixel phone.
SSDLite is an object detection model that aims to produce bounding boxes around objects in an image. SSDLite uses MobileNet for feature extraction to enable real-time object detection on mobile devices.
MobileNetV3 is a convolutional neural network that is tuned to mobile phone CPUs through a combination of hardware-aware network architecture search (NAS) complemented by the NetAdapt algorithm, and then subsequently improved through novel architecture advances.
MobileNetV3 is faster and more accurate than MobileNetV2 on classification task, but this is not necessarily true on different task, such as object detection. As you mention yourself, optimizations they did on the deepest end of network are mostly relevant to the classification variant, and as can be seen on the table you referenced, the mAP is no better.
Few things to consider though:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With