I have been studying RetinaNet recently. I read the original paper and some related ones and wrote a post sharing what I have learnt: http://blog.zenggyu.com/en/post/2018-12-05/retinanet-explained-and-demystified/. However, I still have some confusions, which I also pointed out in the post. Can anyone please enlighten me?
Confusion #1
As indicated by the paper, an anchor box is assigned to background if its IoU with any ground-truth is below 0.4. In this case, what should be the corresponding classification target label (assuming there's K classes)?
I know that SSD has a background class (which makes K+1 classes in total), while YOLO predicts an confidence score indicating whether there is an object in the box (not background) or not (background) in addition to the K class probabilities. While I didn't find any statements in the paper indicating RetinaNet includes a background class, I did see this statement: "..., we only decode box predictions from ..., after thresholding detector confidence at 0.05", which seems to indicate that there's a prediction for confidence score. However, where does this score come from (since the classification subnet only outputs K numbers indicating the probability of K classes)?
If RetinaNet defines target labels differently from SSD or YOLO, I would assume that the target is a length-K vector with all 0s entries and no 1s. However, in this case how does the focal loss (see definition below) will punish an anchor if it is a false negative?
where
Confusion #2
Unlike many other detectors, RetinaNet uses a class-agnostic bounding box regressor, and the activation of the last layer of classification subnet is sigmoid activation. Does this mean that one anchor box can simultaneously predict multiple objects of different classes?
Confusion #3
Let's denote these matching pairs of anchor box and ground-truth box as ${(A^i, G^i)}_{i=1,...N}$, where $A$ represents an anchor, $G$ represents a ground-truth, and $N$ is the number of matches.
For each matching anchor, the regression subnet predicts four numbers, which we denote as $P^i = (P^i_x, P^i_y, P^i_w, P^i_h)$. The first two numbers specify the offset between the centers of anchor $A^i$ and ground-truth $G^i$, while the last two numbers specify the offset between the width/height of the anchor and the ground-truth. Correspondingly, for each of these predictions, there is a regression target $T^i$ computed as the offset between the anchor and the ground-truth:
Are the above equations correct?
Many thanks in advance and feel free to point out any other misunderstandings in the post!
Update:
For future reference, another confusion I had when I was studying RetinaNet (I found this conversation in slack):
I am one of the authors of the open source retinanet project, fizyr/keras-retinanet. I'll try to answer your questions as best as I can.
Confusion #1
In general there are two frequently used methods for classification scores in object detectors, either you use softmax or you use sigmoid.
If you use softmax, your target values should always be one-hot vectors, meaning if there is no object you should "classify" it as background (meaning you need a background class). The benefit is that your class scores always sum up to one.
If you use sigmoid there are less constraints. This has two benefits in my opinion, you don't need a background class (which makes the implementation cleaner) and it allows the network to do multi-class classification (although it is not supported in our implementation, it is theoretically possible). A small additional benefit is that your network is slightly smaller, since it needs to classify one class less compared to softmax, though this is probably neglible.
In the early days of implementing retinanet we used softmax, because of legacy code from py-faster-rcnn. I contacted the author of the Focal Loss paper and asked him about the softmax/sigmoid situation. His answer was that it was a matter of personal preference and it doesn't matter much if you use one or the other. Because of the mentioned benefits for sigmoid, it is now my personal preference as well.
However, where does this score come from (since the classification subnet only outputs K numbers indicating the probability of K classes)?
Every class score is treated as its own object, but for one anchor they all share the same regression values. If the class score is above that threshold (which I'm pretty sure is arbitrarily chosen), it is considered a candidate object.
If RetinaNet defines target labels differently from SSD or YOLO, I would assume that the target is a length-K vector with all 0s entries and no 1s. However, in this case how does the focal loss (see definition below) will punish an anchor if it is a false negative?
Negatives are classified as a vector containing only zeros. Positives are classified by a one-hot vector. Assuming the prediction is a vector of all zeros but the target was a one-hot vector (in other words, a false negative), then p_t
is a list of zeros in your formula. The focal loss will then evaluate to a large value for that anchor.
Confusion #2
Short answer: yes.
Confusion #3
With respect to the original implementation it is almost correct. All values are divided by the width
or height
. Dividing by A_x
, A_y
for the values of T_x
and T_y
is incorrect.
That said, a while back we switched to a slightly simpler implementation where the regression is computed as the difference between the top-left and bottom-right points instead (as a fraction w.r.t. the anchors' width and height). This simplified the implementation a bit since we use top-left / bottom-right throughout the code. In addition I noticed our results slightly increased on COCO.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With