I am working on a project to detect following classes {cars, trucks, buses} then extract the respective license plate.
This question is about the detection of respective classes. I have used the traditional method where I used HOG features with linear SVM and it works but with low accuracy. I am trying to look into CNN for deep learning based detection which has shown higher accuracy. Papers like R-CNN is extremely show and I completely understand how it works.
Recently the YOLO model is shown a very fast detection which is quite interesting. If I can guess correctly, then YOLO is roughly similar to DPM.
Generally, YOLO has 24 convolutional layers and 2 fully connected layers. NVIDIA DIGITS implements a DetectNet based on the this YOLO paper. What I am confused is that DetectNet by NVIDIA does not have any Fully-Connected Layers (Caffe Model File). Instead output from the last convolutional layer is passed through a dimensional reducing convolutional layers which I think outputs some confidence in having an object.
Question 1
But I dont understand how a convolutional layers replaces FC-Layers and learns to predict the object? Detail explaination on this will be very helpful.
The simple answer : Yes. We don't need to use the Dense layer in tensorflow or keras. But... what does that really mean? How important is that? Let's look at some code that does MNIST classification with using the Denise layer
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Conv2D, MaxPool2D, InputLayer, Reshape
# get some image data for classification
(xtrain,ytrain),(xtest,ytest) = tf.keras.datasets.mnist.load_data()
xtrain = np.reshape(xtrain,[-1,28,28,1]) / 255.0
ytrain = np.eye(10)[ytrain]
xtest = np.reshape(xtest,[-1,28,28,1]) / 255.0
ytest = np.eye(10)[ytest]
# make a convolution model with any dense or fully connected layers
model = tf.keras.models.Sequential([
InputLayer([28,28,1]),
Conv2D(filters=16, kernel_size=3, activation='tanh', padding='valid', kernel_initializer='he_normal'),
Conv2D(filters=16, kernel_size=3, activation='tanh', padding='valid', kernel_initializer='he_normal'),
MaxPool2D(pool_size=2),
Conv2D(filters=24, kernel_size=3, activation='tanh', padding='valid', kernel_initializer='he_normal'),
Conv2D(filters=24, kernel_size=3, activation='tanh', padding='valid', kernel_initializer='he_normal'),
MaxPool2D(pool_size=2),
Conv2D(filters=32, kernel_size=4, activation='tanh', padding='valid', kernel_initializer='he_normal'),
Conv2D(filters=10, kernel_size=1, activation='softmax', padding='valid', kernel_initializer='he_normal'),
Reshape([10])
])
model.summary()
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
_ = model.fit(x=xtrain,y=ytrain, validation_data=(xtest,ytest))
It will classify MNIST after 1 epoch with this result
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 26, 26, 16) 160
_________________________________________________________________
conv2d_1 (Conv2D) (None, 24, 24, 16) 2320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 12, 12, 16) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 10, 10, 24) 3480
_________________________________________________________________
conv2d_3 (Conv2D) (None, 8, 8, 24) 5208
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 4, 4, 24) 0
_________________________________________________________________
conv2d_4 (Conv2D) (None, 1, 1, 32) 12320
_________________________________________________________________
conv2d_5 (Conv2D) (None, 1, 1, 10) 330
_________________________________________________________________
reshape (Reshape) (None, 10) 0
=================================================================
Total params: 23,818
Trainable params: 23,818
Non-trainable params: 0
_________________________________________________________________
60000/60000 [==============================] - 28s 467us/sample - loss: 0.1709 - acc: 0.9543 - val_loss: 0.0553 - val_acc: 0.9838
The accuracy isn't great, but certainly well above random. We can see from the model definition not a single fully connected layer ( tf.keras.layers.Dense ) was used.
BUT, the layer conv2d_4
which is Conv2D(filters=32, kernel_size=4, ...
layer is effectively doing the same operation that Flatten()
followed by Dense(32, ...)
would do.
Then conv2d_5
which is Conv2D(filters=10, kernel_size=1, ...
is effectively doing the same operation as Dense(10, ...)
would do. The key difference is that in the above model these operations use the convolution framework. It looks cool but under the covers when the kernel_size is the same as the whole height x width, its computation is identical to a fully connected layer.
Technically the answer is, no dense layer was used. In the spirit of acknowledging the underlying computation, yes the final layers act like fully connected layers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With