I am reading the faster-rcnn code of tensorflow models. I am confused with the use of tf.stop_gradient.
Consider the following code snippet:
if self._is_training:
    proposal_boxes = tf.stop_gradient(proposal_boxes)
    if not self._hard_example_miner:
    (groundtruth_boxlists, groundtruth_classes_with_background_list, _,
     groundtruth_weights_list
    ) = self._format_groundtruth_data(true_image_shapes)
    (proposal_boxes, proposal_scores,
     num_proposals) = self._sample_box_classifier_batch(
         proposal_boxes, proposal_scores, num_proposals,
         groundtruth_boxlists, groundtruth_classes_with_background_list,
         groundtruth_weights_list)
More code is here. My question is: what happens if tf.stop_gradient is not set for proposal_boxes?
tf. gradients (loss, embed) computes the partial derivative of the tensor loss with respect to the tensor embed. TensorFlow computes this partial derivative by backpropagation, so it is expected behavior that evaluating the result of tf. gradients (...) performs backpropagation.
tf.stop_gradient () is an operation that acts as the identity function in the forward direction but stops the accumulated gradient from flowing through that operator in the backward direction.
TensorFlow computes this partial derivative by backpropagation, so it is expected behavior that evaluating the result of tf. gradients (...) performs backpropagation. However, evaluating that tensor does not perform any variable updates, because the expression does not include any assignment operations.
Disabling gradient calculation is useful for inference, when you are sure that you will not call Tensor.backward (). It will reduce memory consumption for computations that would otherwise have requires_grad=True.
This is really a good question, because this simple line tf.stop_gradient is very crucial in training faster_rcnn models. Here is why it is needed during training.
Faster_rcnn models are two-staged detectors and the loss function has to fulfill the goal of both stages. In faster_rcnn, the rpn loss as well as fast_rcnn loss both need to be minimized.
Here is what the paper says in section 3.2
Both RPN and Fast R-CNN, trained independently will modify their convlolutional layers in different ways. We therefore need to develop a technique that allows for sharing convolutional layers between the two networks, rather than learning two separate networks.
The paper then describes three training schemes and in the original paper they adopted the first solution -- Alternating training, that is train RPN first and then train Fast-RCNN.
The second scheme is Approximate joint training, it is easy to implement and this scheme is adopted by the API. The Fast R-CNN accepts the input coordinates from the predicted bounding boxes (by rpn), so the Fast R-CNN loss will have gradients w.r.t the bounding boxes coordinates. But in this training scheme those gradients are ignored, which is exactly why tf.stop_gradient is used. The paper reports that this training scheme will reduce the training time by 25-50%.
The third scheme is Non-approximate joint training, so no tf.stop_gradient is needed. The paper reports that having an RoI pooling layer that is differentiable w.r.t the box coordinates is a nontrivial problem.
But why are those gradients ignored?
It turns out the RoI pooling layer is fully differentiable but the main reason to favor scheme two is scheme three will cause it to be unstable early during training.
One of the authors of the API had a really good answer here
Some further reading regarding approximate joint training.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With