Spacy, Strange similarity between two sentences

Tags:

I have downloaded en_core_web_lg model and trying to find similarity between two sentences:

nlp = spacy.load('en_core_web_lg')

search_doc = nlp("This was very strange argument between american and british person")

main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

print(main_doc.similarity(search_doc))

Which returns very strange value:

0.9066019751888448

These two sentences should not be 90% similar they have very different meanings.

Why this is happening? Do I need to add some kind of additional vocabulary in order to make similarity result more reasonable?

883

asked Aug 31 '18 10:08

Mr.D

2 Answers

Spacy constructs sentence embedding by averaging the word embeddings. Since, in an ordinary sentence, there are a lot of meaningless words (called stop words), you get poor results. You can remove them like this:

search_doc = nlp("This was very strange argument between american and british person")
main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

search_doc_no_stop_words = nlp(' '.join([str(t) for t in search_doc if not t.is_stop]))
main_doc_no_stop_words = nlp(' '.join([str(t) for t in main_doc if not t.is_stop]))

print(search_doc_no_stop_words.similarity(main_doc_no_stop_words))

or only keep nouns, since they have the most information:

doc_nouns = nlp(' '.join([str(t) for t in doc if t.pos_ in ['NOUN', 'PROPN']]))

117

answered Sep 18 '22 17:09

Johannes Filter

The Spacy documentation for vector similarity explains the basic idea of it:
Each word has a vector representation, learned by contextual embeddings (Word2Vec), which are trained on the corpora, as explained in the documentation.

Now, the word embedding of a full sentence is simply the average over all different words. If you now have a lot of words that semantically lie in the same region (as for example filler words like "he", "was", "this", ...), and the additional vocabulary "cancels out", then you might end up with a similarity as seen in your case.

The question is rightfully what you can do about it: From my perspective, you could come up with a more complex similarity measure. As the search_doc and main_doc have additional information, like the original sentence, you could modify the vectors by a length difference penalty, or alternatively try to compare shorter pieces of the sentence, and compute pairwise similarities (then again, the question would be which parts to compare).

For now, there is no clean way to simply resolve this issue, sadly.

answered Sep 17 '22 17:09

dennlinger

Related questions
                            
                                Pandas: compare list objects in Series
                            
                                How do you make a case for Django [or Ruby on Rails] to non-technical clients [closed]
                            
                                Is it possible to temporarily disable Python's string interpolation?
                            
                                `return None` in python not recommended. How to bypass?
                            
                                SQLAlchemy and SQLite: database is locked
                            
                                Python: create a new column from existing columns
                            
                                tkinter Treeview: get selected item values
                            
                                Synchronous v/s Asynchronous
                            
                                How Postgresql COPY TO STDIN With CSV do on conflic do update?
                            
                                Iterate through a file lines in python [duplicate]
                            
                                "".join(reversed(val)) vs val[::-1]...which is pythonic?
                            
                                Delete newline / return carriage in file output
                            
                                matplotlib: adding second axes() with transparent background?
                            
                                Tkinter Label does not show Image
                            
                                Listing select option values with Selenium and Python
                            
                                flask-mail gmail: connection refused
                            
                                Cut and Paste a File or Directory in Python [duplicate]
                            
                                Install Tkinter On Amazon Linux
                            
                                No such file or directory: '/usr/local/bin/pip'
                            
                                How to create a neural network for regression?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spacy, Strange similarity between two sentences

Tags:

python

nlp

spacy

Mr.D

People also ask

2 Answers

Johannes Filter

dennlinger

Recent Activity

Donate For Us