Is it possible to use Google BERT to calculate similarity between two textual documents?

Tags:

Is it possible to use Google BERT for calculating similarity between two textual documents? As I understand BERT's input is supposed to be a limited size sentences. Some works use BERT for similarity calculation for sentences like:

https://github.com/AndriyMulyar/semantic-text-similarity

https://github.com/beekbin/bert-cosine-sim

Is there an implementation of BERT done to use it for large documents instead of sentences as inputs ( Documents with thousands of words)?

572

asked Sep 11 '19 05:09

Youcef

1 Answers

BERT is a sentence representation model. It is trained to predict words in a sentence and to decide if two sentences follow each other in a document, i.e., strictly on the sentence level. Moreover, BERT requires quadratic memory with respect to the input length which would not be feasible with documents.

It is quite common practice to average word embeddings to get a sentence representation. You can try the same thing with BERT and average the [CLS] vectors from BERT over sentences in a document.

There are some document-level embeddings. For instance doc2vec is a commonly used option.

As far as I know, at the document level, frequency-based vectors such as tf-idf (with a good implementation in scikit-learn) are still close to state of the art, so I would not hesitate using it. Or at least it is worth trying to see how it compares to embeddings.

105

answered Sep 22 '22 23:09

Jindřich

Related questions
                            
                                Keras custom decision threshold for precision and recall
                            
                                Pandas mapping to TRUE/FALSE as String, not Boolean
                            
                                Handling errors in psycopg2 - one error seems to create more?
                            
                                Pandas: Count the first consecutive True values
                            
                                How to 'see' / highlight tabs and spaces in PyCharm for checking indentation?
                            
                                How to remove or change the default help command?
                            
                                How to mock os.listdir to pretend files and directories in Python?
                            
                                flask-jwt-extended: Fake Authorization Header during testing (pytest)
                            
                                reading special characters text from .ini file in python
                            
                                Using word2vec to classify words in categories
                            
                                RuntimeError: There is no current event loop in thread 'Thread-1' , multithreading and asyncio error
                            
                                classification metrics can't handle a mix of continuous-multioutput and multi-label-indicator targets
                            
                                How to use inverse_transform in MinMaxScaler for a column in a matrix
                            
                                How to calculate Rolling Correlation with pandas?
                            
                                Force Jupyter Notebook *not* to open a web browser
                            
                                How can I use smoothing techniques to remove jitter in pose estimation? [closed]
                            
                                Two inputs to one model in Keras
                            
                                dataframe to dict such that one column is the key and the other is the value [duplicate]
                            
                                Filter Pandas dataframe based on combination of two columns
                            
                                'c' argument looks like a single numeric RGB or RGBA sequence

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it possible to use Google BERT to calculate similarity between two textual documents?

Tags:

python

text

nlp

scikit-learn

word-embedding

Youcef

People also ask

1 Answers

Jindřich

Recent Activity

Donate For Us