I have already trained gensim doc2Vec model, which is finding most similar documents to an unknown one.
Now I need to find the similarity value between two unknown documents (which were not in the training data, so they can not be referenced by doc id)
d2v_model = doc2vec.Doc2Vec.load(model_file)
string1 = 'this is some random paragraph'
string2 = 'this is another random paragraph'
vec1 = d2v_model.infer_vector(string1.split())
vec2 = d2v_model.infer_vector(string2.split())
in the code above vec1 and vec2 are successfully initialized to some values and of size - 'vector_size'
now looking through the gensim api and examples I could not find method that works for me, all of them are expecting TaggedDocument
Can I compare the feature vectors value by value and if they are closer => the texts are more similar?
The simplest way to compute the similarity between two documents using word embeddings is to compute the document centroid vector. This is the vector that's the average of all the word vectors in the document.
You can just add the word vectors of one sentence together. Then count the Cosine similarity of two sentence vector as the similarity of two sentence.
Hello just In case someone is interested, to do this you just need the cosine distance between the two vectors.
I found that most people are using 'spatial' for this pourpose
Here is a small code sniped that should work pretty well if you already have trained doc2vec
from gensim.models import doc2vec
from scipy import spatial
d2v_model = doc2vec.Doc2Vec.load(model_file)
fisrt_text = '..'
second_text = '..'
vec1 = d2v_model.infer_vector(fisrt_text.split())
vec2 = d2v_model.infer_vector(second_text.split())
cos_distance = spatial.distance.cosine(vec1, vec2)
# cos_distance indicates how much the two texts differ from each other:
# higher values mean more distant (i.e. different) texts
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With