Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gensim Compute centroid from list of words

How to compute the centroid of given 5 words from the word-embedding and then find the most similar words from that centroid. (In gensim)

like image 603
Sherlock Avatar asked Jan 24 '26 16:01

Sherlock


1 Answers

You should checkout the Word2Vec gensim tutorial

from gensim.test.utils import datapath
from gensim import utils


class MyCorpus:
    """An iterator that yields sentences (lists of str)."""

    def __iter__(self):
        corpus_path = datapath('lee_background.cor')
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)


import gensim.models

sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences)
word_vectors = model.wv


import numpy as np

centroid = np.average([word_vectors[w] for w in ['king', 'man', 'walk', 'tennis', 'victorian']], axis=0)

word_vectors.similar_by_vector(centroid)

which will give you in this case

[('man', 0.9996674060821533),
 ('by', 0.9995684623718262),
 ('over', 0.9995648264884949),
 ('from', 0.9995632171630859),
 ('were', 0.9995599389076233),
 ('who', 0.99954754114151),
 ('today', 0.9995439648628235),
 ('which', 0.999538004398346),
 ('on', 0.9995279312133789),
 ('being', 0.9995211958885193)]
like image 184
louis_guitton Avatar answered Jan 27 '26 07:01

louis_guitton