Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In language modeling, why do I have to init_hidden weights before every new epoch of training? (pytorch)

I have a question about the following code in pytorch language modeling:

print("Training and generating...")
    for epoch in range(1, config.num_epochs + 1): 
        total_loss = 0.0
        model.train()  
        hidden = model.init_hidden(config.batch_size)  

        for ibatch, i in enumerate(range(0, train_len - 1, seq_len)):
            data, targets = get_batch(train_data, i, seq_len)          
            hidden = repackage_hidden(hidden)
            model.zero_grad()

            output, hidden = model(data, hidden)
            loss = criterion(output.view(-1, config.vocab_size), targets)
            loss.backward()  

Please check line 5.

And the init_hidden function is as follows:

def init_hidden(self, bsz):
    weight = next(self.parameters()).data
    if self.rnn_type == 'LSTM':  # lstm:(h0, c0)
        return (Variable(weight.new(self.n_layers, bsz, self.hi_dim).zero_()),
                Variable(weight.new(self.n_layers, bsz, self.hi_dim).zero_()))
    else:  # gru & rnn:h0
        return Variable(weight.new(self.n_layers, bsz, self.hi_dim).zero_())

My question is:

Why do we need to init_hidden every epoch? Shouldn't it be that the model inherit the hidden parameters from last epoch and continue training on them.

like image 718
Adrian Avatar asked Mar 26 '19 06:03

Adrian


People also ask

How does PyTorch RNN work?

The PyTorch RNN activation function is defined as how the weighted sum of input is altered into an output from a node or nodes in a layer of the network. Code: In the following code, we will import the torch module from which the activation function of rnn model start working.

How do you train a recurrent neural network?

To train a recurrent neural network, you use an application of back-propagation called back-propagation through time. The gradient values will exponentially shrink as it propagates through each time step. Again, the gradient is used to make adjustments in the neural networks weights thus allowing it to learn.


2 Answers

The answer lies in init_hidden. It is not the hidden layer weights but the initial hidden state in RNN/LSTM, which is h0 in the formulas. For every epoch, we should re-initialize a new beginner hidden state, this is because during the testing, our model will have no information about the test sentence and will have a zero initial hidden state.

like image 162
Adrian Avatar answered Oct 17 '22 07:10

Adrian


The hidden state stores the internal state of the RNN from predictions made on previous tokens in the current sequence, this allows RNNs to understand context. The hidden state is determined by the output of the previous token.

When you predict for the first token of any sequence, if you were to retain the hidden state from the previous sequence your model would perform as if the new sequence was a continuation of the old sequence which would give worse results. Instead for the first token you initialise an empty hidden state, which will then be filled with the model state and used for the second token.

Think about it this way: if someone asked you to classify a sentence and handed you the US constitution (irrelevant information) vs. if someone gave you some background context about the sentence and then asked you to classify the sentence.

like image 27
Teymour Aldridge Avatar answered Oct 17 '22 07:10

Teymour Aldridge