why take the first hidden state for sequence classification (DistilBertForSequenceClassification) by HuggingFace

Tags:

In the last few layers of sequence classification by HuggingFace, they took the first hidden state of the sequence length of the transformer output to be used for classification.

hidden_state = distilbert_output[0]  # (bs, seq_len, dim) <-- transformer output
pooled_output = hidden_state[:, 0]  # (bs, dim)           <-- first hidden state
pooled_output = self.pre_classifier(pooled_output)  # (bs, dim)
pooled_output = nn.ReLU()(pooled_output)  # (bs, dim)
pooled_output = self.dropout(pooled_output)  # (bs, dim)
logits = self.classifier(pooled_output)  # (bs, dim)

Is there any benefit to taking the first hidden state over the last, average, or even the use of a Flatten layer instead?

607

asked Feb 06 '20 04:02

doe

1 Answers

Yes, this is directly related to the way that BERT is trained. Specifically, I encourage you to have a look at the original BERT paper, in which the authors introduce the meaning of the [CLS] token:

[CLS] is a special symbol added in front of every input example [...].

Specifically, it is used for classification purposes, and therefore the first and simplest choice for any fine-tuning for classification tasks. What your relevant code fragment is doing, is basically just extracting this [CLS] token.

Unfortunately, the DistilBERT documentation of Huggingface's library does not explicitly refer to this, but you rather have to check out their BERT documentation, where they also highlight some issues with the [CLS] token, analogous to your concerns:

Alongside MLM, BERT was trained using a next sentence prediction (NSP) objective using the [CLS] token as a sequence approximate. The user may use this token (the first token in a sequence built with special tokens) to get a sequence prediction rather than a token prediction. However, averaging over the sequence may yield better results than using the [CLS] token.

answered Oct 07 '22 08:10

dennlinger

Related questions
                            
                                identifying a period with particular characteristic using sql
                            
                                seasonal_decompose: operands could not be broadcast together with shapes on a series
                            
                                Best practices for efficient multiple time series analysis
                            
                                R filter() dealing with NAs
                            
                                Order latest records by timestamp in Cassandra
                            
                                R: count days that start at sunset
                            
                                Pandas Grouper by weekday?
                            
                                How to use apply_ufunc with numpy.digitize for each image along time dimension of xarray.DataArray?
                            
                                How to use deep learning models for time-series forecasting?
                            
                                Converting zoo to ts before forecasting
                            
                                Can pandas plot a time-series without trying to convert the index to Periods?
                            
                                Get mean of last N weekdays for pandas dataframe
                            
                                Display time inside a clock time after performing mathematical operations
                            
                                compare two time series (simulation results)
                            
                                Lowess Smoothing of Time Series data python
                            
                                Grafana Prometheus Counter
                            
                                How to shift a pandas MultiIndex Series?
                            
                                Aggregate data to weekly level with every week starting from Monday
                            
                                Non-linear multivariate time-series response prediction using RNN
                            
                                How to vectorize a loop through pandas series when values are used in slice of another series

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

why take the first hidden state for sequence classification (DistilBertForSequenceClassification) by HuggingFace

Tags:

sequence

time-series

tensorflow2.0

text-classification

huggingface-transformers

doe

People also ask

1 Answers

dennlinger

Recent Activity

Donate For Us