Why does the gated activation function (used in Wavenet) work better than a ReLU?

Tags:

I have recently been reading the Wavenet and PixelCNN papers, and in both of them they mention that using gated activation functions work better than a ReLU. But in neither cases they offer an explanation as to why that is.

I have asked on other platforms (like on r/machinelearning) but I have not gotten any replies so far. Might it be that they just tried (by chance) this replacement and it turned out to yield favorable results?

Function for reference: y = tanh(Wk,f ∗ x) . σ(Wk,g ∗ x)

Element-wise multiplication between the sigmoid and tanh of the convolution.

722

asked May 09 '19 14:05

Ahmad Moussa

1 Answers

I did some digging and talked some more with a friend, who pointed me towards a paper by Dauphin et. al. about "Language Modeling with Gated Convolutional Networks". He offers a good explanation on this topic in section 3 of the paper:

LSTMs enable long-term memory via a separate cell controlled by input and forget gates. This allows information to flow unimpeded through potentially many timesteps. Without these gates, information could easily vanish through the transformations of each timestep.

In contrast, convolutional networks do not suffer from the same kind of vanishing gradient and we find experimentally that they do not require forget gates. Therefore, we consider models possessing solely output gates, which allow the network to control what information should be propagated through the hierarchy of layers.

In other terms, that means, that they adopted the concept of gates and applied them to sequential convolutional layers, to control what type of information is being let through, and apparently this works better than using a ReLU.

edit: But WHY it works better, I still don't know, if anyone could give me an even remotely intuitive answer I would be grateful, I looked around a bit more, and apparently we are still basing our judgement on trial and error.

110

answered Oct 13 '22 14:10

Ahmad Moussa

Related questions
                            
                                How to interpret output of .predict() from fitted scikit-survival model in python?
                            
                                Predicting a users next action based on current day and time
                            
                                Multi label regression in Caffe
                            
                                Optimize deep Q network with long episode
                            
                                Understanding FeatureHasher, collisions and vector size trade-off
                            
                                NLP for extracting actions from text
                            
                                Libsvm precomputed kernels
                            
                                Production architecture for big data real time machine learning application?
                            
                                Using adaboost within R's caret package
                            
                                Is Apache Spark less accurate than Scikit Learn?
                            
                                Use a metric after a classifier in a Pipeline
                            
                                How to include batch size in pytorch basic example?
                            
                                Problem with missing and unexpected keys while loading my model in Pytorch
                            
                                Classify data using Apache Mahout
                            
                                No. of hidden layers, units in hidden layers and epochs till Neural Network starts behaving acceptable on Training data
                            
                                How do you visualize a ward tree from sklearn.cluster.ward_tree?
                            
                                Is the xgboost documentation wrong ? (early stopping rounds and best and last iteration)
                            
                                Should binary features be one-hot encoded?
                            
                                Python OCR: ignore signatures in documents
                            
                                Keras reports TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does the gated activation function (used in Wavenet) work better than a ReLU?

Tags:

machine-learning

neural-network

deep-learning

activation-function

Ahmad Moussa

People also ask

1 Answers

Ahmad Moussa

Recent Activity

Donate For Us