I want to implement Yolo v1 but I have some question regarding the algorithm.
I understand the fact that in YOLO, we divide the image per cell (7x7) and we predict a fixed number of bounding boxes (2 by default in the paper, with 4 coordinates : x, y , w, h), a confidence score and we predict also classes score for each cell. During the testing step, we can use NMS algorithm so as to remove multiple detection of an object.
1) When do we divide the image into a grid ? In fact, when I read the paper they mentionned to divide the image, but when I look at the architecture of the network it seems that we have two part : the convolutional layers and the FC layers. Does that mean the network do it "naturally" with the bounding boxes output ? The size of the grid 7x7 is it specific to the convolution part use it the paper ? If we use for example VGG does it change the size of the grid ?
EDIT : It seems that the grid is divided "virtually" thanks to the output for our network.
2) 2 bounding boxes are used for each cell. But in a cell, we can predict only one object. Why do we use two bounding boxes ?
At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.
3) I do not really get this quote. In fact, it is said that there is one bounding boxe for each object in the image. But the bounding boxe is limited to the cell, so how does YOLO work when the object is bigger than one cell ?
4) Regarding the output layer, it is said they use a linear activation function, but does it use a max value equal to 1? because they said they normalize the coordinates between 0 and 1 (and i think it is the same for the confidence and class prediction).
1)The output of the final layer will be a vector of size SxSx(5B+C). That means that if you will take this vector and you will take first 5 values those will be x,y,w,h and confidence for the first box in the 1-st cell, then second five values will correspond to the second bounding box in the first cell, than you will have C values that correspond to class probabilities, let's say you have two classes and the following output of the network [0.21 0.98], so the second class has bigger probability and that means that network thinks it is second class in this grid cell. So yes, you are right image is divided virtually.
2)When they train the network they chose which predictor(read one box out of B boxes in certain grid cell) to penalize. They choose that one predictor by the highest IoU with ground truth. Quote from paper: "We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth." So let's say during prediction the first box has IoU of 0.3 and the second box has IoU of 0.7, we chose the second box to be responsible for predicting that object and we will accumulate the loss only from that box. So, for example, during training, the network will naturally learn to predict tall boxes (people) with first predictor and wide boxes with the second predictor (cars). So the reason to use multiple boxes is to be able to predict boxes with different aspect ratios.
3) "But the bounding box is limited to the cell, so how does YOLO work when the object is bigger than one cell ?". Bounding box predicted by the YOLO is not limited to the grid cell, only its (x,y) coordinates are limited to the grid cell. They write in the paper: "The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image.". So as you can see they predict width and height of a bbox with respect to the whole image, not grid cell.
4) Well, I don't know the answer to this question, but I can say that in their code they also use detection layer, that calculates loss, IoUs and a lot of other stuff. I'm not that good at reading their code, but you may have better luck: this is code for detection layer in yolo github
P.S. Another good source of information about YOLO: Joseph Redmon's presentation on youtube
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With