Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why the 6 in relu6?

Tags:

tensorflow

I've hacked a deep feed forward NN from scratch in R, and it seems more stable with "hard sigmoid" activations - max(0,min(1,x)) - than ReLU. Trying to port it to TensorFlow, and noticed that they don't have this activation function built in, only relu6, which uses an upper cutoff at 6. Is there a reason for this? (I realize that you could do relu6(x*6)/6, but if the TF guys put the 6 there for a good reason, I'd like to know.) Also, I'd like to know if others have explosion problems with ReLU in feed forward nets (I'm aware of RNN issues).

like image 675
FaultyBagnose Avatar asked Nov 10 '17 10:11

FaultyBagnose


People also ask

What ReLU 6?

ReLU6 is a modification of the rectified linear unit where we limit the activation to a maximum size of . This is due to increased robustness when used with low-precision computation. Image Credit: PyTorch. Source: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.

What is ReLU6 used for?

RELU6 is an activation function commonly used in deep convolutional neural networks. It comes up fairly often in mobile machine learning cases because it's used in Google's optimized MobileNet architecture and would cause errors when trying to convert to run on device.

What kind of an activation function is ReLU6?

ReLU6 activation function Any input value which is 6 or greater than 6 will be restricted to the value 6 (hence the name). The ReLU6 function is made up of three linear components. It is a non-linear function.

What is SoftPlus?

SoftPlus is a smooth approximation to the ReLU function and can be used to constrain the output of a machine to always be positive.


2 Answers

From this reddit thread:

This is useful in making the networks ready for fixed-point inference. If you unbound the upper limit, you lose too many bits to the Q part of a Q.f number. Keeping the ReLUs bounded by 6 will let them take a max of 3 bits (upto 8) leaving 4/5 bits for .f

It seems, then, that 6 is just an arbitrary value chosen according to the number of bits you want to be able to compress your network's trained parameters into. As per the "why" only the version with value 6 is implemented, I assume it's because that's the value that fits best in 8 bits, which probably is the most common use-case.

like image 70
GPhilo Avatar answered Oct 01 '22 05:10

GPhilo


Tensorflows documentation (https://www.tensorflow.org/api_docs/python/tf/nn/relu6) points to the following paper:

... First, we cap the units at 6, so our ReLU activation function is y = min(max(x, 0), 6). In our tests, this encourages the model to learn sparse features earlier. In the formulation of [8], this is equivalent to imagining that each ReLU unit consists of only 6 replicated bias-shifted Bernoulli units, rather than an infinute amount. We will refer to ReLU units capped at n as ReLU-n units.

http://www.cs.utoronto.ca/~kriz/conv-cifar10-aug2010.pdf

Since it originates from the paper, I suspect that they tested it with different n's and got the best results for their testset with n=6.

like image 37
Rick Avatar answered Oct 01 '22 06:10

Rick