Is it possible to use Google BERT for calculating similarity between two textual documents? As I understand BERT's input is supposed to be a limited size sentences. Some works use BERT for similarity calculation for sentences like:
https://github.com/AndriyMulyar/semantic-text-similarity
https://github.com/beekbin/bert-cosine-sim
Is there an implementation of BERT done to use it for large documents instead of sentences as inputs ( Documents with thousands of words)?
The simplest way to compute the similarity between two documents using word embeddings is to compute the document centroid vector. This is the vector that's the average of all the word vectors in the document.
Jaccard Distance - Jaccard Index is used to calculate the similarity between two finite sets. Jaccard Distance can be considered as 1 - Jaccard Index. We can use Cosine or Euclidean distance if we can represent documents in the vector space.
BERT is a sentence representation model. It is trained to predict words in a sentence and to decide if two sentences follow each other in a document, i.e., strictly on the sentence level. Moreover, BERT requires quadratic memory with respect to the input length which would not be feasible with documents.
It is quite common practice to average word embeddings to get a sentence representation. You can try the same thing with BERT and average the [CLS] vectors from BERT over sentences in a document.
There are some document-level embeddings. For instance doc2vec is a commonly used option.
As far as I know, at the document level, frequency-based vectors such as tf-idf (with a good implementation in scikit-learn) are still close to state of the art, so I would not hesitate using it. Or at least it is worth trying to see how it compares to embeddings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With