Recently I started toying with neural networks. I was trying to implement an <code>AND</code> gate with Tensorflow. I am having trouble understanding when to use different cost and activation functions. This is a basic neural network with only input and output layers, no hidden layers. First I tried to implement it in this way. As you can see this is a poor implementation, but I think it gets the job done, at least in some way. So, I tried only the real outputs, no one hot true outputs. For activation functions, I used a sigmoid function and for cost function I used squared error cost function (I think its called that, correct me if I'm wrong). I've tried using ReLU and Softmax as activation functions (with the same cost function) and it doesn't work. I figured out why they don't work. I also tried the sigmoid function with Cross Entropy cost function, it also doesn't work. <pre class="prettyprint"><code>import tensorflow as tf import numpy train_X = numpy.asarray([[0,0],[0,1],[1,0],[1,1]]) train_Y = numpy.asarray([[0],[0],[0],[1]]) x = tf.placeholder("float",[None, 2]) y = tf.placeholder("float",[None, 1]) W = tf.Variable(tf.zeros([2, 1])) b = tf.Variable(tf.zeros([1, 1])) activation = tf.nn.sigmoid(tf.matmul(x, W)+b) cost = tf.reduce_sum(tf.square(activation - y))/4 optimizer = tf.train.GradientDescentOptimizer(.1).minimize(cost) init = tf.initialize_all_variables() with tf.Session() as sess: sess.run(init) for i in range(5000): train_data = sess.run(optimizer, feed_dict={x: train_X, y: train_Y}) result = sess.run(activation, feed_dict={x:train_X}) print(result) </code></pre> after 5000 iterations: <pre class="prettyprint"><code>[[ 0.0031316 ] [ 0.12012422] [ 0.12012422] [ 0.85576665]] </code></pre> Question 1 - Is there any other activation function and cost function, that can work(learn) for the above network, without changing the parameters(meaning without changing W, x, b). Question 2 - I read from a StackOverflow post here: <blockquote> [Activation Function] selection depends on the problem. </blockquote> So there are no cost functions that can be used anywhere? I mean there is no standard cost function that can be used on any neural network. Right? Please correct me on this. I also implemented the <code>AND</code> gate with a different approach, with the output as one-hot true. As you can see the <code>train_Y</code> <code>[1,0]</code> means that the 0th index is 1, so the answer is 0. I hope you get it. Here I have used a softmax activation function, with cross entropy as cost function. Sigmoid function as activation function fails miserably. <pre class="prettyprint"><code>import tensorflow as tf import numpy train_X = numpy.asarray([[0,0],[0,1],[1,0],[1,1]]) train_Y = numpy.asarray([[1,0],[1,0],[1,0],[0,1]]) x = tf.placeholder("float",[None, 2]) y = tf.placeholder("float",[None, 2]) W = tf.Variable(tf.zeros([2, 2])) b = tf.Variable(tf.zeros([2])) activation = tf.nn.softmax(tf.matmul(x, W)+b) cost = -tf.reduce_sum(y*tf.log(activation)) optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(cost) init = tf.initialize_all_variables() with tf.Session() as sess: sess.run(init) for i in range(5000): train_data = sess.run(optimizer, feed_dict={x: train_X, y: train_Y}) result = sess.run(activation, feed_dict={x:train_X}) print(result) </code></pre> after 5000 iteration <pre class="prettyprint"><code>[[ 1.00000000e+00 1.41971401e-09] [ 9.98996437e-01 1.00352429e-03] [ 9.98996437e-01 1.00352429e-03] [ 1.40495342e-03 9.98595059e-01]] </code></pre> Question 3 So in this case what cost function and activation function can I use? How do I understand what type of cost and activation functions I should use? Is there a standard way or rule, or just experience only? Should I have to try every cost and activation function in a brute force manner? I found an answer here. But I am hoping for a more elaborate explanation. Question 4 I have noticed that it takes many iterations to converge to a near accurate prediction. I think the convergance rate depends on the learning rate (using too large of will miss the solution) and the cost function (correct me if I'm wrong). So, is there any optimal way (meaning the fastest) or cost function for converging to a correct solution?

I will answer your questions a little bit out of order, starting with more general answers, and finishing with those specific to your particular experiment. Activation functions Different activation functions, in fact, do have different properties. Let's first consider an activation function between two layers of a neural network. The only purpose of an activation function there is to serve as an nonlinearity. If you do not put an activation function between two layers, then two layers together will serve no better than one, because their effect will still be just a linear transformation. For a long while people were using sigmoid function and tanh, choosing pretty much arbitrarily, with sigmoid being more popular, until recently, when ReLU became the dominant nonleniarity. The reason why people use ReLU between layers is because it is non-saturating (and is also faster to compute). Think about the graph of a sigmoid function. If the absolute value of <code>x</code> is large, then the derivative of the sigmoid function is small, which means that as we propagate the error backwards, the gradient of the error will vanish very quickly as we go back through the layers. With ReLU the derivative is <code>1</code> for all positive inputs, so the gradient for those neurons that fired will not be changed by the activation unit at all and will not slow down the gradient descent. For the last layer of the network the activation unit also depends on the task. For regression you will want to use the sigmoid or tanh activation, because you want the result to be between 0 and 1. For classification you will want only one of your outputs to be one and all others zeros, but there's no differentiable way to achieve precisely that, so you will want to use a softmax to approximate it. Your example. Now let's look at your example. Your first example tries to compute the output of <code>AND</code> in a following form: <pre class="prettyprint"><code>sigmoid(W1 * x1 + W2 * x2 + B) </code></pre> Note that <code>W1</code> and <code>W2</code> will always converge to the same value, because the output for (<code>x1</code>, <code>x2</code>) should be equal to the output of (<code>x2</code>, <code>x1</code>). Therefore, the model that you are fitting is: <pre class="prettyprint"><code>sigmoid(W * (x1 + x2) + B) </code></pre> <code>x1 + x2</code> can only take one of three values (0, 1 or 2) and you want to return <code>0</code> for the case when <code>x1 + x2 < 2</code> and 1 for the case when <code>x1 + x2 = 2</code>. Since the sigmoid function is rather smooth, it will take very large values of <code>W</code> and <code>B</code> to make the output close to the desired, but because of a small learning rate they can't get to those large values fast. Increasing the learning rate in your first example will increase the speed of convergence. Your second example converges better because the <code>softmax</code> function is good at making precisely one output be equal to <code>1</code> and all others to <code>0</code>. Since this is precisely your case, it does converge quickly. Note that <code>sigmoid</code> would also eventually converge to good values, but it will take significantly more iterations (or higher learning rate). What to use. Now to the last question, how does one choose which activation and cost functions to use. These advices will work for majority of cases: <ol> <li>If you do classification, use <code>softmax</code> for the last layer's nonlinearity and <code>cross entropy</code> as a cost function.</li> <li>If you do regression, use <code>sigmoid</code> or <code>tanh</code> for the last layer's nonlinearity and <code>squared error</code> as a cost function.</li> <li>Use ReLU as a nonlienearity between layers.</li> <li>Use better optimizers (<code>AdamOptimizer</code>, <code>AdagradOptimizer</code>) instead of <code>GradientDescentOptimizer</code>, or use momentum for faster convergence,</li> </ol>

Choosing from different cost function and activation function of a neural network

Tags:

python

machine-learning

neural-network

tensorflow

svm

Recently I started toying with neural networks. I was trying to implement an AND gate with Tensorflow. I am having trouble understanding when to use different cost and activation functions. This is a basic neural network with only input and output layers, no hidden layers.

First I tried to implement it in this way. As you can see this is a poor implementation, but I think it gets the job done, at least in some way. So, I tried only the real outputs, no one hot true outputs. For activation functions, I used a sigmoid function and for cost function I used squared error cost function (I think its called that, correct me if I'm wrong).

I've tried using ReLU and Softmax as activation functions (with the same cost function) and it doesn't work. I figured out why they don't work. I also tried the sigmoid function with Cross Entropy cost function, it also doesn't work.

import tensorflow as tf import numpy  train_X = numpy.asarray([[0,0],[0,1],[1,0],[1,1]]) train_Y = numpy.asarray([[0],[0],[0],[1]])  x = tf.placeholder("float",[None, 2]) y = tf.placeholder("float",[None, 1])  W = tf.Variable(tf.zeros([2, 1])) b = tf.Variable(tf.zeros([1, 1]))  activation = tf.nn.sigmoid(tf.matmul(x, W)+b) cost = tf.reduce_sum(tf.square(activation - y))/4 optimizer = tf.train.GradientDescentOptimizer(.1).minimize(cost)  init = tf.initialize_all_variables()  with tf.Session() as sess:     sess.run(init)     for i in range(5000):         train_data = sess.run(optimizer, feed_dict={x: train_X, y: train_Y})      result = sess.run(activation, feed_dict={x:train_X})     print(result)

after 5000 iterations:

[[ 0.0031316 ] [ 0.12012422] [ 0.12012422] [ 0.85576665]]

Question 1 - Is there any other activation function and cost function, that can work(learn) for the above network, without changing the parameters(meaning without changing W, x, b).

Question 2 - I read from a StackOverflow post here:

[Activation Function] selection depends on the problem.

So there are no cost functions that can be used anywhere? I mean there is no standard cost function that can be used on any neural network. Right? Please correct me on this.

I also implemented the AND gate with a different approach, with the output as one-hot true. As you can see the train_Y [1,0] means that the 0th index is 1, so the answer is 0. I hope you get it.

Here I have used a softmax activation function, with cross entropy as cost function. Sigmoid function as activation function fails miserably.

import tensorflow as tf import numpy  train_X = numpy.asarray([[0,0],[0,1],[1,0],[1,1]]) train_Y = numpy.asarray([[1,0],[1,0],[1,0],[0,1]])  x = tf.placeholder("float",[None, 2]) y = tf.placeholder("float",[None, 2])  W = tf.Variable(tf.zeros([2, 2])) b = tf.Variable(tf.zeros([2]))  activation = tf.nn.softmax(tf.matmul(x, W)+b)  cost = -tf.reduce_sum(y*tf.log(activation))  optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(cost)  init = tf.initialize_all_variables()  with tf.Session() as sess:     sess.run(init)     for i in range(5000):         train_data = sess.run(optimizer, feed_dict={x: train_X, y: train_Y})      result = sess.run(activation, feed_dict={x:train_X})     print(result)

after 5000 iteration

[[  1.00000000e+00   1.41971401e-09]  [  9.98996437e-01   1.00352429e-03]  [  9.98996437e-01   1.00352429e-03]  [  1.40495342e-03   9.98595059e-01]]

Question 3 So in this case what cost function and activation function can I use? How do I understand what type of cost and activation functions I should use? Is there a standard way or rule, or just experience only? Should I have to try every cost and activation function in a brute force manner? I found an answer here. But I am hoping for a more elaborate explanation.

Question 4 I have noticed that it takes many iterations to converge to a near accurate prediction. I think the convergance rate depends on the learning rate (using too large of will miss the solution) and the cost function (correct me if I'm wrong). So, is there any optimal way (meaning the fastest) or cost function for converging to a correct solution?

611

asked Dec 11 '15 17:12

Shubhashis

1 Answers

I will answer your questions a little bit out of order, starting with more general answers, and finishing with those specific to your particular experiment.

Activation functions Different activation functions, in fact, do have different properties. Let's first consider an activation function between two layers of a neural network. The only purpose of an activation function there is to serve as an nonlinearity. If you do not put an activation function between two layers, then two layers together will serve no better than one, because their effect will still be just a linear transformation. For a long while people were using sigmoid function and tanh, choosing pretty much arbitrarily, with sigmoid being more popular, until recently, when ReLU became the dominant nonleniarity. The reason why people use ReLU between layers is because it is non-saturating (and is also faster to compute). Think about the graph of a sigmoid function. If the absolute value of x is large, then the derivative of the sigmoid function is small, which means that as we propagate the error backwards, the gradient of the error will vanish very quickly as we go back through the layers. With ReLU the derivative is 1 for all positive inputs, so the gradient for those neurons that fired will not be changed by the activation unit at all and will not slow down the gradient descent.

For the last layer of the network the activation unit also depends on the task. For regression you will want to use the sigmoid or tanh activation, because you want the result to be between 0 and 1. For classification you will want only one of your outputs to be one and all others zeros, but there's no differentiable way to achieve precisely that, so you will want to use a softmax to approximate it.

Your example. Now let's look at your example. Your first example tries to compute the output of AND in a following form:

sigmoid(W1 * x1 + W2 * x2 + B)

Note that W1 and W2 will always converge to the same value, because the output for (x1, x2) should be equal to the output of (x2, x1). Therefore, the model that you are fitting is:

sigmoid(W * (x1 + x2) + B)

x1 + x2 can only take one of three values (0, 1 or 2) and you want to return 0 for the case when x1 + x2 < 2 and 1 for the case when x1 + x2 = 2. Since the sigmoid function is rather smooth, it will take very large values of W and B to make the output close to the desired, but because of a small learning rate they can't get to those large values fast. Increasing the learning rate in your first example will increase the speed of convergence.

Your second example converges better because the softmax function is good at making precisely one output be equal to 1 and all others to 0. Since this is precisely your case, it does converge quickly. Note that sigmoid would also eventually converge to good values, but it will take significantly more iterations (or higher learning rate).

What to use. Now to the last question, how does one choose which activation and cost functions to use. These advices will work for majority of cases:

If you do classification, use softmax for the last layer's nonlinearity and cross entropy as a cost function.
If you do regression, use sigmoid or tanh for the last layer's nonlinearity and squared error as a cost function.
Use ReLU as a nonlienearity between layers.
Use better optimizers (AdamOptimizer, AdagradOptimizer) instead of GradientDescentOptimizer, or use momentum for faster convergence,

166

answered Sep 28 '22 05:09

Ishamael

Related questions
                            
                                Permanently set Python path for Anaconda within Cygwin
                            
                                Is there any shorthand for 'yield all the output from a generator'?
                            
                                Pandas: if row in column A contains "x", write "y" to row in column B
                            
                                How to get a classifier's confidence score for a prediction in sklearn?
                            
                                What is a good heuristic to detect if a column in a pandas.DataFrame is categorical?
                            
                                How do I install packages in PyCharm for all projects?
                            
                                Printing all the contents of a tensor
                            
                                Flask APP - ValueError: signal only works in main thread
                            
                                Force type conversion in python dataclass __init__ method
                            
                                Django Admin's "view on site" points to example.com instead of my domain
                            
                                numpy array of objects
                            
                                Most elegant way to modify elements of nested lists in place
                            
                                Combining Devanagari characters
                            
                                Parent instance is not bound to a Session; lazy load operation of attribute ’account’ cannot proceed
                            
                                Display python unittest results in nice, tabular form [closed]
                            
                                ImportError: No module named jinja2
                            
                                Why is the range object "not an iterator"? [duplicate]
                            
                                A faster alternative to Pandas `isin` function
                            
                                QLayout: Attempting to add QLayout "" to QWidget "", which already has a layout
                            
                                copy data from csv to postgresql using python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With