Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Applying word2vec to find all words above a similarity threshold

The command model.most_similar(positive=['france'], topn=100) gives the top 100 most similar words to "france". However, I would like to know if there is a method which will output the most similar words above a similarity threshold to a given word. Is there a method like the following?: model.most_similar(positive=['france'], threshold=0.9)

like image 828
sss90 Avatar asked Mar 20 '18 18:03

sss90


1 Answers

No, you'd have to request a large number (or all, with topn=0) then apply the cutoff yourself.

What you request could theoretically be added as an option.

However, the cosine-similarity absolute magnitudes don't necessarily have a stable meaning, like "90% similar" across different model runs. Their distribution can vary based on model training parameters, such as the vector size, and they are most-often interpreted only in ranked-comparison to other pairwise values from the same model.

For example, the composition of the top-100 most-similar words for 'cold' may be very similar in models with different training parameters, but the range of absolute similarity values for the #1 to #100 words can be quite different. So if you were picking an absolute threshold, you'd likely want to vary the cutoff based on observing the model, or along with other model training metaparameters.

like image 64
gojomo Avatar answered Sep 20 '22 17:09

gojomo