Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How inverting the dropout compensates the effect of dropout and keeps expected values unchanged?

I'm learning regularization in Neural networks from deeplearning.ai course. Here in dropout regularization, the professor says that if dropout is applied, the calculated activation values will be smaller then when the dropout is not applied (while testing). So we need to scale the activations in order to keep the testing phase simpler.

I understood this fact, but I don't understand how scaling is done. Here is a code sample which is used to implement inverted dropout.

keep_prob = 0.8   # 0 <= keep_prob <= 1
l = 3  # this code is only for layer 3
# the generated number that are less than 0.8 will be dropped. 80% stay, 20% dropped
d3 = np.random.rand(a[l].shape[0], a[l].shape[1]) < keep_prob

a3 = np.multiply(a3,d3)   # keep only the values in d3

# increase a3 to not reduce the expected value of output
# (ensures that the expected value of a3 remains the same) - to solve the scaling problem
a3 = a3 / keep_prob  

In the above code, why the activations are divided by 0.8 or the probability of keeping a node in a layer (keep_prob)? Any numerical example will help.

like image 947
Kaushal28 Avatar asked Jul 25 '19 02:07

Kaushal28


2 Answers

I got the answer by myself after spending some time understanding the inverted dropout. Here is the intuition:

We are preserving the neurons in any layer with the probability keep_prob. Let's say keep_prob = 0.6. This means to shut down 40% of the neurons in any layer. If the original output of the layer before shutting down 40% of neurons was x, then after applying 40% dropout, it'll be reduced by 0.4 * x. So now it will be x - 0.4x = 0.6x.

To maintain the original output (expected value), we need to divide the output by keep_prob (or 0.6 here).

like image 95
Kaushal28 Avatar answered Sep 28 '22 09:09

Kaushal28


Another way of looking at it could be:

TL;DR: Even though due to dropout we have fewer neurons, we want the neurons to contribute the same amount to the output as when we had all the neurons.

With dropout = 0.20, we're "shutting down 20% of the neurons", that's also the same as "keeping 80% of the neurons."

Say the number of neurons is x. "Keeping 80%" is concretely 0.8 * x. Dividing x again by the keep_prob helps "scale it back" to the original value, which is x/0.8:

x = 0.8 * x # x is 80% of what it used to be
x = x/0.8   # x is scaled back up to its original value

Now, the purpose of the inverting is to assure that the Z value will not be impacted by the reduction of W. (Cousera).

When we scale down a3 by keep_prob, we're inadvertently also scaling down the value of z4 (Since, z4 = W4 * a3 + b4). To compensate for this scaling, we need to divide it by keep_prob, to scale it back up. (Stackoverflow)

# keep 80% of the neurons
keep_prob = 0.8 
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob
a3 = np.multiply(a3, d3)

# Scale it back up
a3 = a3 / keep_prob  

# this way z4 is not affected
z4 = W4 * a3 + b4

What happens if you don't scale?

With scaling:
-------------
Cost after iteration 0: 0.6543912405149825
Cost after iteration 10000: 0.061016986574905605
Cost after iteration 20000: 0.060582435798513114

On the train set:
Accuracy: 0.9289099526066351
On the test set:
Accuracy: 0.95


Without scaling:
-------------
Cost after iteration 0: 0.6634619861891963
Cost after iteration 10000: 0.05040089794130624
Cost after iteration 20000: 0.049722351029060516

On the train set:
Accuracy: 0.933649289099526
On the test set:
Accuracy: 0.95

Though this is just a single example with one dataset, I'm not sure if it makes a major difference in shallow neural networks. Perhaps it pertains more to deeper architectures.

like image 35
Jacob Avatar answered Sep 28 '22 08:09

Jacob