I am working on Deep Nets using keras. There is an activation "hard sigmoid". Whats its mathematical definition ?
I know what is Sigmoid. Someone asked similar question on Quora: https://www.quora.com/What-is-hard-sigmoid-in-artificial-neural-networks-Why-is-it-faster-than-standard-sigmoid-Are-there-any-disadvantages-over-the-standard-sigmoid
But I could not find the precise mathematical definition anywhere ?
The Hard Sigmoid is an activation function used for neural networks of the form: f ( x ) = max ( 0 , min ( 1 , ( x + 1 ) 2 ) ) Image Source: Rinat Maksutov. Source: BinaryConnect: Training Deep Neural Networks with binary weights during propagations.
Hyperbolic Tangent Function However, this time the function is defined as (-1, + 1). The advantage over the sigmoid function is that its derivative is more steep, which means it can get more value. This means that it will be more efficient because it has a wider range for faster learning and grading.
Hard Swish is a type of activation function based on Swish, but replaces the computationally expensive sigmoid with a piecewise linear analogue: h-swish ( x ) = x ReLU6 ( x + 3 ) 6. Source: Searching for MobileNetV3.
an S-shaped curve that describes many processes in psychology, including learning and responding to test items. The curve starts low, has a period of acceleration, and then approaches an asymptote. Often, the curve is characterized by the logistic function.
Since Keras supports both Tensorflow and Theano, the exact implementation might be different for each backend - I'll cover Theano only. For Theano backend Keras uses T.nnet.hard_sigmoid
, which is in turn linearly approximated standard sigmoid:
slope = tensor.constant(0.2, dtype=out_dtype)
shift = tensor.constant(0.5, dtype=out_dtype)
x = (x * slope) + shift
x = tensor.clip(x, 0, 1)
i.e. it is: max(0, min(1, x*0.2 + 0.5))
The hard sigmoid is normally a piecewise linear approximation of the logistic sigmoid function. Depending on what properties of the original sigmoid you want to keep, you can use a different approximation.
I personally like to keep the function correct at zero, i.e. σ(0) = 0.5
(shift) and σ'(0) = 0.25
(slope). This could be coded as follows
def hard_sigmoid(x):
return np.maximum(0, np.minimum(1, (x + 2) / 4))
For reference, the hard sigmoid function
may be defined differently in different places. In Courbariaux et al. 2016 [1] it's defined as:
σ is the “hard sigmoid” function: σ(x) = clip((x + 1)/2, 0, 1) = max(0, min(1, (x + 1)/2))
The intent is to provide a probability value (hence constraining it to be between 0
and 1
) for use in stochastic binarization of neural network parameters (e.g. weight, activation, gradient). You use the probability p = σ(x)
returned from the hard sigmoid function to set the parameter x
to +1
with p
probability, or -1
with probability 1-p
.
[1] https://arxiv.org/abs/1602.02830 - "Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1", Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, Yoshua Bengio, (Submitted on 9 Feb 2016 (v1), last revised 17 Mar 2016 (this version, v3))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With