Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to use Google BERT to calculate similarity between two textual documents?

Is it possible to use Google BERT for calculating similarity between two textual documents? As I understand BERT's input is supposed to be a limited size sentences. Some works use BERT for similarity calculation for sentences like:

https://github.com/AndriyMulyar/semantic-text-similarity

https://github.com/beekbin/bert-cosine-sim

Is there an implementation of BERT done to use it for large documents instead of sentences as inputs ( Documents with thousands of words)?

like image 572
Youcef Avatar asked Sep 11 '19 05:09

Youcef


People also ask

How do you find the similarity between two text files?

The simplest way to compute the similarity between two documents using word embeddings is to compute the document centroid vector. This is the vector that's the average of all the word vectors in the document.

How can we measure document similarity?

Jaccard Distance - Jaccard Index is used to calculate the similarity between two finite sets. Jaccard Distance can be considered as 1 - Jaccard Index. We can use Cosine or Euclidean distance if we can represent documents in the vector space.


1 Answers

BERT is a sentence representation model. It is trained to predict words in a sentence and to decide if two sentences follow each other in a document, i.e., strictly on the sentence level. Moreover, BERT requires quadratic memory with respect to the input length which would not be feasible with documents.

It is quite common practice to average word embeddings to get a sentence representation. You can try the same thing with BERT and average the [CLS] vectors from BERT over sentences in a document.

There are some document-level embeddings. For instance doc2vec is a commonly used option.

As far as I know, at the document level, frequency-based vectors such as tf-idf (with a good implementation in scikit-learn) are still close to state of the art, so I would not hesitate using it. Or at least it is worth trying to see how it compares to embeddings.

like image 105
Jindřich Avatar answered Sep 22 '22 23:09

Jindřich