Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Passing multiple sentences to BERT?

I have a dataset with paragraphs that I need to classify into two classes. These paragraphs are usually 3-5 sentences long. The overwhelming majority of them are less than 500 words long. I would like to make use of BERT to tackle this problem.

I am wondering how I should use BERT to generate vector representations of these paragraphs and especially, whether it is fine to just pass the whole paragraph into BERT?

There have been informative discussions of related problems here and here. These discussions focus on how to use BERT for representing whole documents. In my case the paragraphs are not that long, and indeed could be passed to BERT without exceeding its maximum length of 512. However, BERT was trained on sentences. Sentences are relatively self-contained units of meaning. I wonder if feeding multiple sentences into BERT doesn't conflict fundamentally with what the model was designed to do (although this appears to be done regularly).

like image 723
jhfodr76 Avatar asked Mar 02 '23 21:03

jhfodr76


1 Answers

I think your question is based on a misconception. Even though the BERT paper uses the term sentence quite often, it is not referring to a linguistic sentence. The paper defines a sentence as

an arbitrary span of contiguous text, rather than an actual linguistic sentence.

It is therefore completely fine to pass whole paragraphs to BERT and a reason why they can handle those.

like image 193
cronoik Avatar answered Mar 07 '23 01:03

cronoik