Applying word2vec to find all words above a similarity threshold

Question

The command model.most_similar(positive=['france'], topn=100) gives the top 100 most similar words to "france". However, I would like to know if there is a method which will output the most similar words above a similarity threshold to a given word. Is there a method like the following?: model.most_similar(positive=['france'], threshold=0.9)

gojomo · Accepted Answer

No, you'd have to request a large number (or all, with topn=0) then apply the cutoff yourself.

What you request could theoretically be added as an option.

However, the cosine-similarity absolute magnitudes don't necessarily have a stable meaning, like "90% similar" across different model runs. Their distribution can vary based on model training parameters, such as the vector size, and they are most-often interpreted only in ranked-comparison to other pairwise values from the same model.

For example, the composition of the top-100 most-similar words for 'cold' may be very similar in models with different training parameters, but the range of absolute similarity values for the #1 to #100 words can be quite different. So if you were picking an absolute threshold, you'd likely want to vary the cutoff based on observing the model, or along with other model training metaparameters.

Applying word2vec to find all words above a similarity threshold

Tags:

gensim

word2vec

sss90

1 Answers

gojomo

Recent Activity

Donate For Us

Applying word2vec to find all words above a similarity threshold

Tags:

gensim

word2vec

sss90

1 Answers

gojomo

Related questions

Recent Activity

Donate For Us