How is MobileNet V3 faster than V2?

Tags:

mobilenet

Here's the link to the paper regarding MobileNet V3.

MobileNet V3

According to the paper, h-swish and Squeeze-and-excitation module are implemented in MobileNet V3, but they aim to enhance the accuracy and don't help boost the speed.

h-swish is faster than swish and helps enhance the accuracy, but is much slower than ReLU if I'm not mistaken.

SE also helps enhance the accuracy, but it increases the number of parameters of the network.

Am I missing something? I still have no idea how MobileNet V3 can be faster than V2 with what's said above implemented in V3.

I didn't mention the fact that they also modify the last part of their network as I plan to use MobileNet V3 as the backbone network and combine it with SSD layers for the detection purpose, so the last part of the network won't be used.

The following table, which can be found in the paper mentioned above, shows that V3 is still faster than V2 is.

Object detection results for comparison

212

asked Jul 09 '19 09:07

Jefferson Chiu

1 Answers

MobileNetV3 is faster and more accurate than MobileNetV2 on classification task, but this is not necessarily true on different task, such as object detection. As you mention yourself, optimizations they did on the deepest end of network are mostly relevant to the classification variant, and as can be seen on the table you referenced, the mAP is no better.

Few things to consider though:

It's true SE and h-swish both slow down the network a bit. SE adds some FLOPs and parameters, and h-swish adds complexity, and both causes some latency. However, both are added such that the accuracy-latency trade-off is better, meaning either the latency addition is worth the accuracy gain, or you can maintain the same accuracy while reducing other stuff, thus reducing overall latency. Specifically regarding h-swish, note that they mostly use it in deeper layers, where the tensors are smaller. They are thicker, but due to quadratic drop in resolution (height x width), they are smaller overall, hence h-swish causes less latency.
The architecture itself (without h-swish, and even without considering the SE) is searched. Meaning it is better suited to the task than "vanilla" MobileNetV2, since the architecture is "less hand-engineered", and actually optimized to the task. You can see for example, that as in MNASNet, some of the kernels grew to 5x5 (rather than 3x3), not all expansion rates are x6, etc.
One change they did to the deepest end of the network is also relevant to object detection. Oddly, while using SSDLite-MobileNetV2, the original authors chose to keep the last 1x1 convolution which expands from depth of 320 to 1280. While this amount of features makes sense for 1000 classes classification, for 80 classes detection it's probably redundant, as the authors of MNv3 say themselves in the middle of page 7 (bottom of first column-top of second).

126

answered Oct 17 '22 23:10

netanel-sam

Related questions
                            
                                Pytorch - inference all images and back-propagate batch by batch
                            
                                ValueError: Cannot take the length of Shape with unknown rank
                            
                                Multi-feature causal CNN - Keras implementation
                            
                                Loss is NaN on image classification task
                            
                                How is learning rate decay implemented by Adam in keras
                            
                                How to Multi-Head learning
                            
                                Why does almost every Activation Function Saturate at Negative Input Values in a Neural Network
                            
                                What should be the Input types for Earth Mover Loss when images are rated in decimals from 0 to 9 (Keras, Tensorflow)
                            
                                How do we get/define filters in convolutional neural networks?
                            
                                Changing the solver parameters in Caffe through pycaffe
                            
                                Multiple accuracy layers in Caffe
                            
                                The speed between ImageDataLayer and LMDB data layer
                            
                                In Tensorflow, can I use tf.gather() for partial connection?
                            
                                What's the difference between Softmax and SoftmaxWithLoss layer in caffe?
                            
                                Min-Max normalization Layer in Caffe
                            
                                Simple LSTM in PyTorch with Sequential module
                            
                                pytorch variable index lost one dimension
                            
                                Running through a dataloader in Pytorch using Google Colab
                            
                                Why so low Prediction Rate 25 - 40 [sec/1] using Faster RCNN for custom object detection on GPU?
                            
                                Unexpected results with CuDNNLSTM (instead of LSTM) layer

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With