Most neural networks bring high accuracy with only one hidden layer, so what is the purpose of multiple hidden layers?

To answer you question you first need to find the reason behind why the term 'deep learning' was coined almost a decade ago. Deep learning is nothing but a neural network with several hidden layers. The term deep roughly refers to the way our brain passes the sensory inputs (specially eyes and vision cortex) through different layers of neurons to do inference. However, until about a decade ago researchers were not able to train neural networks with more than 1 or two hidden layers due to different issues arising such as vanishing, exploding gradients, getting stuck in local minima, and less effective optimization techniques (compared to what is being used nowadays) and some other issues. In 2006 and 2007 several researchers 1 and 2 showed some new techniques enabling a better training of neural networks with more hidden layers and then since then the era of deep learning has started. In deep neural networks the goal is to mimic what the brain does (hopefully). Before describing more, I may point out that from an abstract point of view the problem in any learning algorithm is to approximate a function given some inputs X and outputs Y. This is also the case in neural network and it has been theoretically proven that a neural network with only one hidden layer using a bounded, continuous activation function as its units can approximate any function. The theorem is coined as universal approximation theorem. However, this raises the question of why current neural networks with one hidden layer cannot approximate any function with a very very high accuracy (say >99%)? This could potentially be due to many reasons: <blockquote> <ul> <li>The current learning algorithms are not as effective as they should be</li> <li>For a specific problem, how one should choose the exact number of hidden units so that the desired function is learned and the underlying manifold is approximated well?</li> <li>The number of training examples could be exponential in the number of hidden units. So, how many training examples one should train a model with? This could turn into a chicken-egg problem!</li> <li>What is the right bounded, continuous activation function and does the universal approximation theorem is generalizable to any other activation function rather than sigmoid? </li> <li>There are also other questions that need to be answered as well but I think the most important ones are the ones I mentioned.</li> </ul> </blockquote> Before one can come up with provable answers to the above questions (either theoretically or empirically), researchers started using more than one hidden layers with limited number of hidden units. Empirically this has shown a great advantage. Although adding more hidden layers increases the computational costs, but it has been empirically proven that more hidden layers learn hierarchical representations of the input data and can better generalize to unseen data as well. By looking at the pictures below you can see how a deep neural network can learn hierarchies of features and combine them successively as we go from the first hidden layer to the one in the end: <img src="https://i.stack.imgur.com/Tmu9G.jpg" alt="enter image description here"> Image taken from here As you can see, the first hidden layer (shown in the bottom) learns some edges, then combining those seemingly, useless representations turn into some parts of the objects and then combining those parts will yield things like faces, cars, elephants, chairs and ... . Note that these results were not achievable if new optimization techniques and new activation functions were not used.

How do multiple hidden layers in a neural network improve its ability to learn?

1 Answers

To answer you question you first need to find the reason behind why the term 'deep learning' was coined almost a decade ago. Deep learning is nothing but a neural network with several hidden layers. The term deep roughly refers to the way our brain passes the sensory inputs (specially eyes and vision cortex) through different layers of neurons to do inference. However, until about a decade ago researchers were not able to train neural networks with more than 1 or two hidden layers due to different issues arising such as vanishing, exploding gradients, getting stuck in local minima, and less effective optimization techniques (compared to what is being used nowadays) and some other issues. In 2006 and 2007 several researchers 1 and 2 showed some new techniques enabling a better training of neural networks with more hidden layers and then since then the era of deep learning has started.

In deep neural networks the goal is to mimic what the brain does (hopefully). Before describing more, I may point out that from an abstract point of view the problem in any learning algorithm is to approximate a function given some inputs X and outputs Y. This is also the case in neural network and it has been theoretically proven that a neural network with only one hidden layer using a bounded, continuous activation function as its units can approximate any function. The theorem is coined as universal approximation theorem. However, this raises the question of why current neural networks with one hidden layer cannot approximate any function with a very very high accuracy (say >99%)? This could potentially be due to many reasons:

The current learning algorithms are not as effective as they should be

For a specific problem, how one should choose the exact number of hidden units so that the desired function is learned and the underlying manifold is approximated well?

The number of training examples could be exponential in the number of hidden units. So, how many training examples one should train a model with? This could turn into a chicken-egg problem!

What is the right bounded, continuous activation function and does the universal approximation theorem is generalizable to any other activation function rather than sigmoid?

There are also other questions that need to be answered as well but I think the most important ones are the ones I mentioned.

Before one can come up with provable answers to the above questions (either theoretically or empirically), researchers started using more than one hidden layers with limited number of hidden units. Empirically this has shown a great advantage. Although adding more hidden layers increases the computational costs, but it has been empirically proven that more hidden layers learn hierarchical representations of the input data and can better generalize to unseen data as well. By looking at the pictures below you can see how a deep neural network can learn hierarchies of features and combine them successively as we go from the first hidden layer to the one in the end:

enter image description here Image taken from here

As you can see, the first hidden layer (shown in the bottom) learns some edges, then combining those seemingly, useless representations turn into some parts of the objects and then combining those parts will yield things like faces, cars, elephants, chairs and ... . Note that these results were not achievable if new optimization techniques and new activation functions were not used.

answered Oct 10 '22 09:10

Amir

Related questions
                            
                                Disease named entity recognition
                            
                                How to approach Machine Learning problems with dynamically sized input collection?
                            
                                bag of words - image classification
                            
                                facial expression classification in real time using SVM
                            
                                Why is scikit-learn's random forest using so much memory?
                            
                                Computing AUC and ROC curve from multi-class data in scikit-learn (sklearn)?
                            
                                Load Custom Dataset (which is like 20 news group set) in Scikit for Classification of text documents
                            
                                Does Caffe need data to be shuffled?
                            
                                How to classify continuous audio
                            
                                Where is the code for gradient descent?
                            
                                SciKit Learn SVR runs very long
                            
                                sklearn roc_auc_score with multi_class=="ovr" should have None average available
                            
                                How to check if a model is in train or eval mode in Pytorch?
                            
                                Random forest on a big dataset
                            
                                Processing large amount of data in Python
                            
                                Determining geo location by arbitrary body of text
                            
                                GridSearchCV no reporting on high verbosity
                            
                                Gradient boosting on Vowpal Wabbit
                            
                                Retrieve indices of NaN values in a pandas dataframe
                            
                                How does pre-training improve classification in neural networks?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do multiple hidden layers in a neural network improve its ability to learn?

Tags:

machine-learning

neural-network

deep-learning

conv-neural-network

RickyTamma

People also ask

1 Answers

Amir

Recent Activity

Donate For Us