In language modeling, why do I have to init_hidden weights before every new epoch of training? (pytorch)

Tags:

I have a question about the following code in pytorch language modeling:

print("Training and generating...")
    for epoch in range(1, config.num_epochs + 1): 
        total_loss = 0.0
        model.train()  
        hidden = model.init_hidden(config.batch_size)  

        for ibatch, i in enumerate(range(0, train_len - 1, seq_len)):
            data, targets = get_batch(train_data, i, seq_len)          
            hidden = repackage_hidden(hidden)
            model.zero_grad()

            output, hidden = model(data, hidden)
            loss = criterion(output.view(-1, config.vocab_size), targets)
            loss.backward()

Please check line 5.

And the init_hidden function is as follows:

def init_hidden(self, bsz):
    weight = next(self.parameters()).data
    if self.rnn_type == 'LSTM':  # lstm：(h0, c0)
        return (Variable(weight.new(self.n_layers, bsz, self.hi_dim).zero_()),
                Variable(weight.new(self.n_layers, bsz, self.hi_dim).zero_()))
    else:  # gru & rnn：h0
        return Variable(weight.new(self.n_layers, bsz, self.hi_dim).zero_())

My question is:

Why do we need to init_hidden every epoch? Shouldn't it be that the model inherit the hidden parameters from last epoch and continue training on them.

718

asked Mar 26 '19 06:03

Adrian

2 Answers

The answer lies in init_hidden. It is not the hidden layer weights but the initial hidden state in RNN/LSTM, which is h0 in the formulas. For every epoch, we should re-initialize a new beginner hidden state, this is because during the testing, our model will have no information about the test sentence and will have a zero initial hidden state.

162

answered Oct 17 '22 07:10

Adrian

The hidden state stores the internal state of the RNN from predictions made on previous tokens in the current sequence, this allows RNNs to understand context. The hidden state is determined by the output of the previous token.

When you predict for the first token of any sequence, if you were to retain the hidden state from the previous sequence your model would perform as if the new sequence was a continuation of the old sequence which would give worse results. Instead for the first token you initialise an empty hidden state, which will then be filled with the model state and used for the second token.

Think about it this way: if someone asked you to classify a sentence and handed you the US constitution (irrelevant information) vs. if someone gave you some background context about the sentence and then asked you to classify the sentence.

answered Oct 17 '22 07:10

Teymour Aldridge

Related questions
                            
                                Tensorflow/keras: "logits and labels must have the same first dimension" How to squeeze logits or expand labels?
                            
                                NLP Transformers: Best way to get a fixed sentence embedding-vector shape?
                            
                                Why PyTorch model takes multiple image size inside the model?
                            
                                Keras 'set_session' not available for Tensorflow 2.0
                            
                                High volume SVM (machine learning) system
                            
                                Machine Learning Algorithm for Completing Sparse Matrix Data
                            
                                Why is logistic regression called regression? [closed]
                            
                                Manual split versus Scikit Grid Search
                            
                                Weights in Convolutional network?
                            
                                DBSCAN for clustering data by location and density
                            
                                What is the difference between the train loss and train error?
                            
                                How to resolve "IndexError: too many indices for array"
                            
                                calculate precision and recall in a confusion matrix
                            
                                Tensorflow: Using neural network to classify positive or negative phrases
                            
                                Dropout rate guidance for hidden layers in a convolution neural network
                            
                                How to build a Language model using LSTM that assigns probability of occurence for a given sentence
                            
                                Tensorflow.js tokenizer
                            
                                XGBoost Best Iteration
                            
                                Classification Report - Precision and F-score are ill-defined
                            
                                Is there some way to save best model only with tensorflow.estimator.train_and_evaluate()?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In language modeling, why do I have to init_hidden weights before every new epoch of training? (pytorch)

Tags:

machine-learning

nlp

pytorch

recurrent-neural-network

Adrian

People also ask

2 Answers

Adrian

Teymour Aldridge

Recent Activity

Donate For Us