Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding influence of random start weights on neural network performance

Using R and the package neuralnet, I try to model data that looks like this:

training data

These are temperature readings in 10 min intervals over several days (above is a 2 day cutout). Using the code below, I fit a neural network to the data. There are probably simpler ways to model this exact data, but in the future the data might look quite different. Using a single hidden layer with 2 neurons gives me satisfactory results:

good neural network performance

This also works most of the time with more layers and neurons. However, with one hidden layer with one neuron and occasionally with two layers (in my case 3 and 2 neurons respectively), I get rather poor results, always in the same shape:

poor neural network performance

The only thing random is the initialization of start weights, so I assume it's related to that. However, I must admit that I have not fully grasped the theory of neural networks yet. What I would like to know is, whether the poor results are due to a local minimum ('neuralnet' uses resilient backpropagation with weight backtracking by default) and I'm simply out of luck, or if I can avoid such a scenario. I am under the impression that there is an optimal number of hidden nodes for fitting e.g. polynomials of degree 2, 5, 10. If not, what's my best course of action? A larger learning rate? Smaller error threshold? Thanks in advance.

I have not tried tuning the rprop parameters yet, so the solution might lie there.

Code:

# DATA ----------------------
minute <- seq(0, 6*24 - 1)
temp <- rep.int(17, 6*24)
temp[(6*7):(6*20)] <- 20
n <- 10
dta <- data.frame(Zeit = minute, Status = temp)
dta <- dta[rep(seq_len(nrow(dta)), n), ]
# Scale everything
maxs <- apply(dta, 2, max) 
mins <- apply(dta, 2, min)

nnInput <- data.frame(Zeit = dta$Zeit, Status = dta$Status)
nnInput <- as.data.frame(scale(nnInput, center = mins, scale = maxs - mins))
trainingData <- nnInput[seq(1, nrow(nnInput), 2), ]
testData     <- nnInput[seq(2, nrow(nnInput), 2), ]

# MODEL ---------------------
model <- as.formula("Status ~ Zeit")
net <- neuralnet::neuralnet(model, 
                            trainingData, 
                            hidden = 2, 
                            threshold = 0.01,
                            linear.output = TRUE,
                            lifesign = "full",
                            stepmax = 100000,
                            rep = 1)

net.results <- neuralnet::compute(net, testData$Zeit)

results <- net.results$net.result * (maxs["Status"] - mins["Status"]) + mins["Status"]
testData <- as.data.frame(t(t(testData) * (maxs - mins) + mins))

cleanOutput <- data.frame(Actual = testData$Status, 
                          Prediction = results, 
                          diff = abs(results - testData$Status))

summary(cleanOutput)

plot(cleanOutput$Actual[1:144], main = "Zeittabelle", xlab = paste("Min. seit 0:00 *", n), ylab = "Temperatur")
lines(cleanOutput$Prediction[1:144], col = "red", lwd = 3)
like image 563
sebastianmm Avatar asked Jul 26 '16 14:07

sebastianmm


People also ask

What is the reason for randomly initializing weights in a neural network?

The weights of artificial neural networks must be initialized to small random numbers. This is because this is an expectation of the stochastic optimization algorithm used to train the model, called stochastic gradient descent.

What happens if the weights initialized randomly can be very high or very low?

Random initialization is a better choice to break the symmetry. However, initializing weight with much high or low value can result in slower optimization.

Is random weight assignment better than assigning weights to the units in the hidden layer?

No matter what was the input - if all weights are the same, all units in hidden layer will be the same too. This is the main issue with symmetry and reason why you should initialize weights randomly (or, at least, with different values). Note, that this issue affects all architectures that use each-to-each connections.

Why don't we just initialize all weights in a neural network to zero?

If all the weights are initialized to zeros, the derivatives will remain same for every w in W[l]. As a result, neurons will learn same features in each iterations. This problem is known as network failing to break symmetry. And not only zero, any constant initialization will produce a poor result.


2 Answers

Basically - initialization is really important. If you don't initialize it randomly then you might make your network not working at all (e.g. by setting all the weights to 0). It is also proven that for sigmoid and relu a certain kind of activation might help in training your network.

But in your case - I think that the differences are mostly made by the complexity of your problem. With a models with a complexity which seem to fit the complexity of your problem performs nice. The other models may suffer for the following reasons:

  1. Too small complexity - with one node maybe you are basically not able to learn the proper function.
  2. Too big complexity - with two-layer network you might experience stucking in a local minimas. Increasing the number of parameters of your network is also increasing the size of parameter space. Of course - one hand you might get the better model - on the other hand - you may land in this region of a parameter space which will result in poor solution. Maybe trying the same model with different initialization - and choosing the best model might overcome this issue.

UPDATE:

  1. With small network sizes - it is quite usual to stuck in a local minimum. Depending on the amount of time which you need to train your network you may use the following techniques to overcome that:

    • Dropout / Batch normalization / Batch learning randomization : when you are able to train your network for a little bit longer time - you might use a randomization properties of dropout or batch normalization. Due to this random fluctuations you are able to move from poor local minima (which are usually believed to be relatively shallow).
    • Cross - validation / Multiple run: When you are starting your training multiple times - the probability that you will finish in a poor minimum significantly decreases.
  2. About the connection between layer size and polynomial degree - I think that the question is not clearly stated. You must specify more details like e.g. the activation function. I also think that the nature of a polynomials and functions which could be modelled by a classic neural networks differs a lot. In polynomials - the small change in parameters values usually tends to much higher difference than in neural network case. Usually a derivative of a neural network is a bounded function whereas the polynomial derivative is unbounded when the degree is bigger that 2. Due to this facts I think - that looking for a dependency between a polynomial degree and a size of a hidden layer might be not worth serious considerations.

like image 122
Marcin Możejko Avatar answered Nov 15 '22 12:11

Marcin Możejko


  • All you need is a good init (2016) : This paper proposes a simple method for weight initialization for deep net learning (http://arxiv.org/abs/1511.06422)

  • Watch this 6 mins video by andrew ng (Machine Learning, Coursera -> Week 5-> Random Initialization) explains danger of setting all initial weights to zero in Backpropagation (https://www.coursera.org/learn/machine-learning/lecture/ND5G5/random-initialization)

enter image description here If we initialize all weights to the same value (e.g. zero or one). In this case, each hidden unit will get exactly the same signal. E.g. if all weights are initialized to 1, each unit gets signal equal to sum of inputs (and outputs sigmoid(sum(inputs))). If all weights are zeros, which is even worse, every hidden unit will get zero signal. No matter what was the input - if all weights are the same, all units in hidden layer will be the same too. This is why one should initialize weights randomly.

like image 34
Sayali Sonawane Avatar answered Nov 15 '22 11:11

Sayali Sonawane