I have recently been reading the Wavenet and PixelCNN papers, and in both of them they mention that using gated activation functions work better than a ReLU. But in neither cases they offer an explanation as to why that is.
I have asked on other platforms (like on r/machinelearning) but I have not gotten any replies so far. Might it be that they just tried (by chance) this replacement and it turned out to yield favorable results?
Function for reference: y = tanh(Wk,f ∗ x) . σ(Wk,g ∗ x)
Element-wise multiplication between the sigmoid and tanh of the convolution.
Efficiency: ReLu is faster to compute than the sigmoid function, and its derivative is faster to compute. This makes a significant difference to training and inference time for neural networks: only a constant factor, but constants can matter.
ReLU activation function is widely used and is default choice as it yields better results. If we encounter a case of dead neurons in our networks the leaky ReLU function is the best choice. ReLU function should only be used in the hidden layers.
ReLU (Rectified Linear Unit) Activation Function The ReLU is the most used activation function in the world right now. Since, it is used in almost all the convolutional neural networks or deep learning.
ReLU stands for Rectified Linear Unit. The main advantage of using the ReLU function over other activation functions is that it does not activate all the neurons at the same time.
I did some digging and talked some more with a friend, who pointed me towards a paper by Dauphin et. al. about "Language Modeling with Gated Convolutional Networks". He offers a good explanation on this topic in section 3 of the paper:
LSTMs enable long-term memory via a separate cell controlled by input and forget gates. This allows information to flow unimpeded through potentially many timesteps. Without these gates, information could easily vanish through the transformations of each timestep.
In contrast, convolutional networks do not suffer from the same kind of vanishing gradient and we find experimentally that they do not require forget gates. Therefore, we consider models possessing solely output gates, which allow the network to control what information should be propagated through the hierarchy of layers.
In other terms, that means, that they adopted the concept of gates and applied them to sequential convolutional layers, to control what type of information is being let through, and apparently this works better than using a ReLU.
edit: But WHY it works better, I still don't know, if anyone could give me an even remotely intuitive answer I would be grateful, I looked around a bit more, and apparently we are still basing our judgement on trial and error.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With