Why use softmax only in the output layer and not in hidden layers?

Tags:

Most examples of neural networks for classification tasks I've seen use the a softmax layer as output activation function. Normally, the other hidden units use a sigmoid, tanh, or ReLu function as activation function. Using the softmax function here would - as far as I know - work out mathematically too.

What are the theoretical justifications for not using the softmax function as hidden layer activation functions?
Are there any publications about this, something to quote?

968

asked Jun 02 '16 10:06

beyeran

1 Answers

I haven't found any publications about why using softmax as an activation in a hidden layer is not the best idea (except Quora question which you probably have already read) but I will try to explain why it is not the best idea to use it in this case :

1. Variables independence : a lot of regularization and effort is put to keep your variables independent, uncorrelated and quite sparse. If you use softmax layer as a hidden layer - then you will keep all your nodes (hidden variables) linearly dependent which may result in many problems and poor generalization.

2. Training issues : try to imagine that to make your network working better you have to make a part of activations from your hidden layer a little bit lower. Then - automaticaly you are making rest of them to have mean activation on a higher level which might in fact increase the error and harm your training phase.

3. Mathematical issues : by creating constrains on activations of your model you decrease the expressive power of your model without any logical explaination. The strive for having all activations the same is not worth it in my opinion.

4. Batch normalization does it better : one may consider the fact that constant mean output from a network may be useful for training. But on the other hand a technique called Batch Normalization has been already proven to work better, whereas it was reported that setting softmax as activation function in hidden layer may decrease the accuracy and the speed of learning.

133

answered Sep 28 '22 11:09

Marcin Możejko

Related questions
                            
                                R - XGBoost: Error building DMatrix
                            
                                Removing then Inserting a New Middle Layer in a Keras Model
                            
                                Keras Sequential model input layer
                            
                                What is the difference between these two ways of saving keras machine learning model weights?
                            
                                show feature names after feature selection
                            
                                How to sum leading diagonal of table in R
                            
                                Realistic time estimates for progress bars etc
                            
                                Machine Learning on server log data
                            
                                Does the dataset size influence a machine learning algorithm?
                            
                                What is rank in ALS machine Learning Algorithm in Apache Spark Mllib
                            
                                How to use OneHotEncoder for multiple columns and automatically drop first dummy variable for each column?
                            
                                NLP/Machine Learning text comparison [closed]
                            
                                Tensorflow: Where is tf.nn.conv2d Actually Executed?
                            
                                How to specify the correlation coefficient as the loss function in keras
                            
                                What does a weighted word embedding mean?
                            
                                Probability and Neural Networks
                            
                                How to calculate a partial Area Under the Curve (AUC)
                            
                                How to get feature Importance in naive bayes?
                            
                                Keras callback ReduceLROnPlateau - cooldown parameter
                            
                                Does GridSearchCV store all the scores for all parameter combinations?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why use softmax only in the output layer and not in hidden layers?

Tags:

machine-learning

neural-network

classification

softmax

activation-function

beyeran

People also ask

1 Answers

Marcin Możejko

Recent Activity

Donate For Us