Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does choosing between pre and post zero padding of sequences impact results

I'm working on an NLP sequence labelling problem. My data consists of variable length sequences (w_1, w_2, ..., w_k) with corresponding labels (l_1, l_2, ..., l_k) (in this case the task is named entity extraction).

I intend to solve the problem using Recurrent Neural Networks. As the sequences are of variable length I need to pad them (I want batch size >1). I have the option of either pre zero padding them, or post zero padding them. I.e. either I make every sequence (0, 0, ..., w_1, w_2, ..., w_k) or (w_1, w_2, ..., w_k, 0, 0, ..., 0) such that the lenght of each sequence is the same.

How does the choice between pre- and post padding impact results?

It seems like pre padding is more common, but I can't find an explanation of why it would be better. Due to the nature of RNNs it feels like an arbitrary choice for me, since they share weights across time steps.

like image 881
langkilde Avatar asked Sep 19 '17 11:09

langkilde


People also ask

Does padding affect performance?

Results show that padding has an effect on model performance even when there are convolutional layers implied.

Does padding affect accuracy?

The padding influences the accuracy. For handling the bad effect of padding, you can define new metrics. This new metric must ignore the class related to padding.

What is meant by zero padding Why do we use it?

Zero padding is a technique typically employed to make the size of the input sequence equal to a power of two. In zero padding, you add zeros to the end of the input sequence so that the total number of samples is equal to the next higher power of two.

Why do we need sequence padding?

Padding is a special form of masking where the masked steps are at the start or the end of a sequence. Padding comes from the need to encode sequence data into contiguous batches: in order to make all sequences in a batch fit a given standard length, it is necessary to pad or truncate some sequences.


2 Answers

Commonly in RNN's, we take the final output or hidden state and use this to make a prediction (or do whatever task we are trying to do).

If we send a bunch of 0's to the RNN before taking the final output (i.e. 'post' padding as you describe), then the hidden state of the network at the final word in the sentence would likely get 'flushed out' to some extent by all the zero inputs that come after this word.

So intuitively, this might be why pre-padding is more popular/effective.

like image 68
nlml Avatar answered Oct 22 '22 19:10

nlml


This paper (https://arxiv.org/pdf/1903.07288.pdf) studied the effect of padding types on LSTM and CNN. They found that post-padding achieved substantially lower accuracy (nearly half) compared to pre-padding in LSTMs, although there wasn't a significant difference for CNNs (post-padding was only slightly worse).

A simple/intuitive explanation for RNNs is that, post-padding seems to add noise to what has been learned from the sequence through time, and there aren't more timesteps for the RNN to recover from this noise. With pre-padding, however, the RNN is better able to adjust to the added noise of zeros at the beginning as it learns from the sequence through time.

I think more thorough experiments are needed in the community for more detailed mechanistic explanations on how padding affects performance.

I always recommend using pre-padding over post-padding, even for CNNs, unless the problem specifically requires post-padding.

like image 38
JafetGado Avatar answered Oct 22 '22 19:10

JafetGado