Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does the gated activation function (used in Wavenet) work better than a ReLU?

I have recently been reading the Wavenet and PixelCNN papers, and in both of them they mention that using gated activation functions work better than a ReLU. But in neither cases they offer an explanation as to why that is.

I have asked on other platforms (like on r/machinelearning) but I have not gotten any replies so far. Might it be that they just tried (by chance) this replacement and it turned out to yield favorable results?

Function for reference: y = tanh(Wk,f ∗ x) . σ(Wk,g ∗ x)

Element-wise multiplication between the sigmoid and tanh of the convolution.

like image 722
Ahmad Moussa Avatar asked May 09 '19 14:05

Ahmad Moussa


People also ask

Why is the ReLU activation function better than the sigmoid activation function for deeper neural networks?

Efficiency: ReLu is faster to compute than the sigmoid function, and its derivative is faster to compute. This makes a significant difference to training and inference time for neural networks: only a constant factor, but constants can matter.

Which activation function is better and why?

ReLU activation function is widely used and is default choice as it yields better results. If we encounter a case of dead neurons in our networks the leaky ReLU function is the best choice. ReLU function should only be used in the hidden layers.

What is the best activation function in neural networks?

ReLU (Rectified Linear Unit) Activation Function The ReLU is the most used activation function in the world right now. Since, it is used in almost all the convolutional neural networks or deep learning.

Why is ReLU the best activation function?

ReLU stands for Rectified Linear Unit. The main advantage of using the ReLU function over other activation functions is that it does not activate all the neurons at the same time.


1 Answers

I did some digging and talked some more with a friend, who pointed me towards a paper by Dauphin et. al. about "Language Modeling with Gated Convolutional Networks". He offers a good explanation on this topic in section 3 of the paper:

LSTMs enable long-term memory via a separate cell controlled by input and forget gates. This allows information to flow unimpeded through potentially many timesteps. Without these gates, information could easily vanish through the transformations of each timestep.

In contrast, convolutional networks do not suffer from the same kind of vanishing gradient and we find experimentally that they do not require forget gates. Therefore, we consider models possessing solely output gates, which allow the network to control what information should be propagated through the hierarchy of layers.

In other terms, that means, that they adopted the concept of gates and applied them to sequential convolutional layers, to control what type of information is being let through, and apparently this works better than using a ReLU.

edit: But WHY it works better, I still don't know, if anyone could give me an even remotely intuitive answer I would be grateful, I looked around a bit more, and apparently we are still basing our judgement on trial and error.

like image 110
Ahmad Moussa Avatar answered Oct 13 '22 14:10

Ahmad Moussa