Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why use softmax only in the output layer and not in hidden layers?

Most examples of neural networks for classification tasks I've seen use the a softmax layer as output activation function. Normally, the other hidden units use a sigmoid, tanh, or ReLu function as activation function. Using the softmax function here would - as far as I know - work out mathematically too.

  • What are the theoretical justifications for not using the softmax function as hidden layer activation functions?
  • Are there any publications about this, something to quote?
like image 968
beyeran Avatar asked Jun 02 '16 10:06

beyeran


People also ask

Why use softmax only in the output layer?

Why use softmax only in the output layer and not in hidden layers? Bookmark this question. Show activity on this post. Most examples of neural networks for classification tasks I've seen use the a softmax layer as output activation function. Normally, the other hidden units use a sigmoid, tanh, or ReLu function as activation function.

What is softmax function in neural network?

Softmax function is used in classifications algorithms where there is a need to obtain probability or probability distribution as the output. Some of these algorithms are following: In artificial neural networks, the softmax function is used in the final / last layer.

Why bake softmax in as the activation function?

Why? Note: It is possible to bake this tf.nn.softmax in as the activation function for the last layer of the network. While this can make the model output more directly interpretable, this approach is discouraged as it’s impossible to provide an exact and numerically stable loss calculation for all models when using a softmax output. tensorflow

How to model multinomial distribution using softmax?

Use a softmax activation wherever you want to model a multinomial distribution. This may be (usually) an output layer y, but can also be an intermediate layer, say a multinomial latent variable z. As mentioned in this thread for outputs {o_i}, sum ({o_i}) = 1 is a linear dependency, which is intentional at this layer.


1 Answers

I haven't found any publications about why using softmax as an activation in a hidden layer is not the best idea (except Quora question which you probably have already read) but I will try to explain why it is not the best idea to use it in this case :

1. Variables independence : a lot of regularization and effort is put to keep your variables independent, uncorrelated and quite sparse. If you use softmax layer as a hidden layer - then you will keep all your nodes (hidden variables) linearly dependent which may result in many problems and poor generalization.

2. Training issues : try to imagine that to make your network working better you have to make a part of activations from your hidden layer a little bit lower. Then - automaticaly you are making rest of them to have mean activation on a higher level which might in fact increase the error and harm your training phase.

3. Mathematical issues : by creating constrains on activations of your model you decrease the expressive power of your model without any logical explaination. The strive for having all activations the same is not worth it in my opinion.

4. Batch normalization does it better : one may consider the fact that constant mean output from a network may be useful for training. But on the other hand a technique called Batch Normalization has been already proven to work better, whereas it was reported that setting softmax as activation function in hidden layer may decrease the accuracy and the speed of learning.

like image 133
Marcin Możejko Avatar answered Sep 28 '22 11:09

Marcin Możejko