Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Considerations for using ReLU as activation function

I'm implementing a neural network, and wanted to use ReLU as the activation function of the neurons. Furthermore, I'm training the network with SDG and back-propagation. I'm testing the neural network with the paradigmatic XOR problem, and up to now, it classifies new samples correctly if I use the logistic function or the hyperbolic tangent as activation functions.

I've been reading about the benefits of using the Leaky ReLU as activation function, and implemented it, in Python, like this:

def relu(data, epsilon=0.1):
    return np.maximum(epsilon * data, data)

where np is the name for NumPy. The associated derivative is implemented like this:

def relu_prime(data, epsilon=0.1):
    if 1. * np.all(epsilon < data):
        return 1
    return epsilon

Using this function as activation I get incorrect results. For example:

  • Input = [0, 0] --> Output = [0.43951457]

  • Input = [0, 1] --> Output = [0.46252925]

  • Input = [1, 0] --> Output = [0.34939594]

  • Input = [1, 1] --> Output = [0.37241062]

It can be seen that the outputs differ greatly from the expected XOR ones. So the question would be, is there any special consideration to use ReLU as activation function?

Please, don't heasitate to ask me for more context or code. Thanks in advance.

EDIT: there is a bug in the derivative, as it only returns a single float value, and not a NumPy array. The correct code should be:

def relu_prime(data, epsilon=0.1):
    gradients = 1. * (data > epsilon)
    gradients[gradients == 0] = epsilon
    return gradients
like image 423
tulians Avatar asked Jan 08 '17 23:01

tulians


2 Answers

Your relu_prime function should be:

def relu_prime(data, epsilon=0.1):
    gradients = 1. * (data > 0)
    gradients[gradients == 0] = epsilon
    return gradients

Note the comparison of each value in the data matrix to 0, instead of epsilon. This follows from the standard definition of leaky ReLUs, which creates a piecewise gradient of 1 when x > 0 and epsilon otherwise.

I can't comment on whether leaky ReLUs are the best choice for the XOR problem, but this should resolve your gradient issue.

like image 94
Nick Becker Avatar answered Oct 22 '22 01:10

Nick Becker


Short answer

Don't use ReLU with binary digits. It is designed to operate with much greater values. Also avoid using it when there is no negative values because it will basically mean that you are using a linear activation function which is not the best one. Best to use with Convolutional Neural Networks.

Long answer

Can't say if there is anything wrong with python code because i code in Java. But logic-wise, I think that using ReLU in this case is a bad decision. As we are predicting XOR there is a limited range to the values of your NN [0,1]. This is also the range of the sigmoid activation function. With ReLU you operate with values [0,infinity] which means there is an awful lot of values that you are never going to use since it is XOR. But the ReLU will still take this values into consideration and the error that you are going to get will increase. That is why you get correct answers about 50% of the time. In fact this value can be as low as 0% and as high as 99%. Moral of the story - when deciding which activation function to use try to match the range of the input values in your NN with the range of the activation function values.

like image 20
Arnis Shaykh Avatar answered Oct 22 '22 01:10

Arnis Shaykh