Measure similarity between two documents using Doc2Vec

Tags:

I have already trained gensim doc2Vec model, which is finding most similar documents to an unknown one.

Now I need to find the similarity value between two unknown documents (which were not in the training data, so they can not be referenced by doc id)

d2v_model = doc2vec.Doc2Vec.load(model_file)

string1 = 'this is some random paragraph'
string2 = 'this is another random paragraph'

vec1 = d2v_model.infer_vector(string1.split())
vec2 = d2v_model.infer_vector(string2.split())

in the code above vec1 and vec2 are successfully initialized to some values and of size - 'vector_size'

now looking through the gensim api and examples I could not find method that works for me, all of them are expecting TaggedDocument

Can I compare the feature vectors value by value and if they are closer => the texts are more similar?

994

asked Nov 27 '18 15:11

Borislav Stoilov

Video Answer

1 Answers

Hello just In case someone is interested, to do this you just need the cosine distance between the two vectors.

I found that most people are using 'spatial' for this pourpose

Here is a small code sniped that should work pretty well if you already have trained doc2vec

from gensim.models import doc2vec
from scipy import spatial

d2v_model = doc2vec.Doc2Vec.load(model_file)

fisrt_text = '..'
second_text = '..'

vec1 = d2v_model.infer_vector(fisrt_text.split())
vec2 = d2v_model.infer_vector(second_text.split())

cos_distance = spatial.distance.cosine(vec1, vec2)
# cos_distance indicates how much the two texts differ from each other:
# higher values mean more distant (i.e. different) texts

answered Oct 23 '22 06:10

Borislav Stoilov

Related questions
                            
                                Normalizing data with binary and continuous variables for machine learning
                            
                                Long/wide data to wide/long
                            
                                remaining connection slots are reserved for non-replication superuser connections
                            
                                Pandas Rolling On DateTime Multi Index Frame
                            
                                How to turn money (in pence) to its individual coins?
                            
                                Difference between datetime.combine() and pytz.localize()
                            
                                python asyncios create_task and await functions
                            
                                Can or How to use Python asyncio on Google Cloud Functions?
                            
                                Import OpenCV on jupyter notebook
                            
                                In unit testing loosely typed languages, should the return type of methods be checked?
                            
                                Python: keep open browser in pyppeteer and create CDPSession
                            
                                Tensorflow Hub : Stuck while importing a model
                            
                                Programmatically create pytest fixtures
                            
                                I trained a keras model on google colab. Now not able to load it locally on my system.
                            
                                Remove package version from Pypi
                            
                                Scrapy - How can I load the project level settings.py while using a script to start the spider
                            
                                Pass a list to function with variable number of args in python [duplicate]
                            
                                Python/Gensim - What is the meaning of syn0 and syn0norm?
                            
                                How to detect contiguous spans in which data changes linearly within a DataFrame?
                            
                                Django Rest Framework - Updating related model using ModelSerializer and ModelViewSet

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Measure similarity between two documents using Doc2Vec

Tags:

python

machine-learning

nlp

gensim

doc2vec

Borislav Stoilov

People also ask

Video Answer

1 Answers

Borislav Stoilov

Recent Activity

Donate For Us