Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Measure similarity between two documents using Doc2Vec

I have already trained gensim doc2Vec model, which is finding most similar documents to an unknown one.

Now I need to find the similarity value between two unknown documents (which were not in the training data, so they can not be referenced by doc id)

d2v_model = doc2vec.Doc2Vec.load(model_file)

string1 = 'this is some random paragraph'
string2 = 'this is another random paragraph'

vec1 = d2v_model.infer_vector(string1.split())
vec2 = d2v_model.infer_vector(string2.split())

in the code above vec1 and vec2 are successfully initialized to some values and of size - 'vector_size'

now looking through the gensim api and examples I could not find method that works for me, all of them are expecting TaggedDocument

Can I compare the feature vectors value by value and if they are closer => the texts are more similar?

like image 994
Borislav Stoilov Avatar asked Nov 27 '18 15:11

Borislav Stoilov


People also ask

How do you find the similarity between two documents?

The simplest way to compute the similarity between two documents using word embeddings is to compute the document centroid vector. This is the vector that's the average of all the word vectors in the document.

How do you find the similarity between two sentences using Word2Vec?

You can just add the word vectors of one sentence together. Then count the Cosine similarity of two sentence vector as the similarity of two sentence.


Video Answer


1 Answers

Hello just In case someone is interested, to do this you just need the cosine distance between the two vectors.

I found that most people are using 'spatial' for this pourpose

Here is a small code sniped that should work pretty well if you already have trained doc2vec

from gensim.models import doc2vec
from scipy import spatial

d2v_model = doc2vec.Doc2Vec.load(model_file)

fisrt_text = '..'
second_text = '..'

vec1 = d2v_model.infer_vector(fisrt_text.split())
vec2 = d2v_model.infer_vector(second_text.split())

cos_distance = spatial.distance.cosine(vec1, vec2)
# cos_distance indicates how much the two texts differ from each other:
# higher values mean more distant (i.e. different) texts
like image 88
Borislav Stoilov Avatar answered Oct 23 '22 06:10

Borislav Stoilov