Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing SEP token in Bert for text classification

Given a sentiment classification dataset, I want to fine-tune Bert.

As you know that BERT created to predict the next sentence given the current sentence. Thus, to make the network aware of this, they inserted a [CLS] token in the beginning of the first sentence then they add [SEP] token to separate the first from the second sentence and finally another [SEP] at the end of the second sentence (it's not clear to me why they append another token at the end).

Anyway, for text classification, what I noticed in some of the examples online (see BERT in Keras with Tensorflow hub) is that they add [CLS] token and then the sentence and at the end another [SEP] token.

Where in other research works (e.g. Enriching Pre-trained Language Model with Entity Information for Relation Classification) they remove the last [SEP] token.

Why is it/not beneficial to add the [SEP] token at the end of the input text when my task uses only single sentence?

like image 999
Minions Avatar asked Jan 13 '20 15:01

Minions


2 Answers

Im not quite sure why BERT needs the separation token [SEP] at the end for single-sentence tasks, but my guess is that BERT is an autoencoding model that, as mentioned, originally was designed for Language Modelling and Next Sentence Prediction. So BERT was trained that way to always expect the [SEP] token, which means that the token is involved in the underlying knowledge that BERT built up during training.

Downstream tasks that followed later, such as single-sentence use-cases (e.g. text classification), turned out to work too with BERT, however the [SEP] was left as a relict for BERT to work properly and thus is needed even for these tasks.

BERT might learn faster, if [SEP] is appended at the end of a single sentence, because it encodes somewhat of a knowledge in that token, that this marks the end of the input. Without it, BERT would still know where the sentence ends (due to the padding tokens), which explains that fore mentioned research leaves away the token, but this might slow down training slightly, since BERT might be able to learn faster with appended [SEP] token, especially if there are no padding tokens in a truncated input.

like image 177
MJimitater Avatar answered Oct 30 '22 08:10

MJimitater


As mentioned in BERT's paper, BERT is pre-trained using two novel unsupervised prediction tasks: Masked Language Model and Next Sentence Prediction. In Next Sentence Prediction task, the model takes a pair of sentences as input and learns to predict whether the second sentence is the next sequence in original document or not.

Accordingly, I think the BERT model uses the relationship between two text sentences in text classification task as well as other tasks. This relationship can be used to predict if these two sentences belong to the same class or not. Therefore, the [SEP] token is needed to merge these two sentences and determine the relationship between them.

like image 36
Soroush Faridan Avatar answered Oct 30 '22 09:10

Soroush Faridan