I'm learning regularization in Neural networks from deeplearning.ai
course. Here in dropout regularization, the professor says that if dropout is applied, the calculated activation values will be smaller then when the dropout is not applied (while testing). So we need to scale the activations in order to keep the testing phase simpler.
I understood this fact, but I don't understand how scaling is done. Here is a code sample which is used to implement inverted dropout.
keep_prob = 0.8 # 0 <= keep_prob <= 1
l = 3 # this code is only for layer 3
# the generated number that are less than 0.8 will be dropped. 80% stay, 20% dropped
d3 = np.random.rand(a[l].shape[0], a[l].shape[1]) < keep_prob
a3 = np.multiply(a3,d3) # keep only the values in d3
# increase a3 to not reduce the expected value of output
# (ensures that the expected value of a3 remains the same) - to solve the scaling problem
a3 = a3 / keep_prob
In the above code, why the activations are divided by 0.8
or the probability of keeping a node in a layer (keep_prob
)? Any numerical example will help.
I got the answer by myself after spending some time understanding the inverted dropout. Here is the intuition:
We are preserving the neurons in any layer with the probability keep_prob
. Let's say keep_prob = 0.6
. This means to shut down 40% of the neurons in any layer. If the original output of the layer before shutting down 40% of neurons was x
, then after applying 40% dropout, it'll be reduced by 0.4 * x
. So now it will be x - 0.4x = 0.6x
.
To maintain the original output (expected value), we need to divide the output by keep_prob
(or 0.6
here).
Another way of looking at it could be:
TL;DR: Even though due to dropout we have fewer neurons, we want the neurons to contribute the same amount to the output as when we had all the neurons.
With dropout = 0.20
, we're "shutting down 20% of the neurons", that's also the same as "keeping 80% of the neurons."
Say the number of neurons is x
. "Keeping 80%" is concretely 0.8 * x
. Dividing x
again by the keep_prob
helps "scale it back" to the original value, which is x/0.8
:
x = 0.8 * x # x is 80% of what it used to be
x = x/0.8 # x is scaled back up to its original value
Now, the purpose of the inverting is to assure that the Z value will not be impacted by the reduction of W. (Cousera).
When we scale down a3
by keep_prob
, we're inadvertently also scaling down the value of z4
(Since, z4 = W4 * a3 + b4
). To compensate for this scaling, we need to divide it by keep_prob
, to scale it back up. (Stackoverflow)
# keep 80% of the neurons
keep_prob = 0.8
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob
a3 = np.multiply(a3, d3)
# Scale it back up
a3 = a3 / keep_prob
# this way z4 is not affected
z4 = W4 * a3 + b4
What happens if you don't scale?
With scaling:
-------------
Cost after iteration 0: 0.6543912405149825
Cost after iteration 10000: 0.061016986574905605
Cost after iteration 20000: 0.060582435798513114
On the train set:
Accuracy: 0.9289099526066351
On the test set:
Accuracy: 0.95
Without scaling:
-------------
Cost after iteration 0: 0.6634619861891963
Cost after iteration 10000: 0.05040089794130624
Cost after iteration 20000: 0.049722351029060516
On the train set:
Accuracy: 0.933649289099526
On the test set:
Accuracy: 0.95
Though this is just a single example with one dataset, I'm not sure if it makes a major difference in shallow neural networks. Perhaps it pertains more to deeper architectures.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With