Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I access output embedding(output vector) in gensim word2vec?

I want to use output embedding of word2vec such as in this paper (Improving document ranking with dual word embeddings).

I know input vectors are in syn0, output vectors are in syn1 and syn1neg if negative sampling.

But when I calculated most_similar with output vector, I got same result in some ranges because of removing syn1 or syn1neg.

Here is what I got.

IN[1]: model = Word2Vec.load('test_model.model')

IN[2]: model.most_similar([model.syn1neg[0]])

OUT[2]: [('of', -0.04402521997690201),
('has', -0.16387106478214264),
('in', -0.16650712490081787),
('is', -0.18117375671863556),
('by', -0.2527652978897095),
('was', -0.254993200302124),
('from', -0.2659570872783661),
('the', -0.26878535747528076),
('on', -0.27521973848342896),
('his', -0.2930959463119507)]

but another syn1neg numpy vector is already similar output.

IN[3]: model.most_similar([model.syn1neg[50]])

OUT[3]: [('of', -0.07884830236434937),
('has', -0.16942456364631653),
('the', -0.1771494299173355),
('his', -0.2043554037809372),
('is', -0.23265135288238525),
('in', -0.24725285172462463),
('by', -0.27772971987724304),
('was', -0.2979024648666382),
('time', -0.3547973036766052),
('he', -0.36455872654914856)]

I want to get output numpy arrays(negative or not) with preserved during training.

Let me know how can I access pure syn1 or syn1neg, or code, or some word2vec module can get output embedding.

like image 592
Suin SEO Avatar asked Mar 02 '17 11:03

Suin SEO


People also ask

What is the output of Word2Vec?

The output of the Word2vec neural net is a vocabulary in which each item has a vector attached to it, which can be fed into a deep-learning net or simply queried to detect relationships between words.

How does gensim Word2Vec work?

Word2Vec is a more recent model that embeds words in a lower-dimensional vector space using a shallow neural network. The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings.

How do you evaluate Word2Vec embeds?

To assess which word2vec model is best, simply calculate the distance for each pair, do it 200 times, sum up the total distance, and the smallest total distance will be your best model.


1 Answers

With negative-sampling, syn1neg weights are per-word, and in the same order as syn0.

The mere fact that your two examples give similar results doesn't necessarily indicate anything is wrong. The words are by default sorted by frequency, so the early words (including those in position 0 and 50) are very-frequent words with very-generic cooccurrence-based meanings (that may all be close to each other).

Pick a medium-frequency word with a more distinct meaning, and you may get more meaningful results (if your corpus/settings/needs are sufficiently like those of the 'dual word embeddings' paper). For example, you might want to compare:

model.most_similar('cousin')

...with...

model.most_similar(positive=[model.syn1neg[model.vocab['cousin'].index])

However, in all cases the existing most_similar() method only looks for similar-vectors in syn0 – the 'IN' vectors of the paper's terminology. So I believe the above code would only really be computing what the paper might call 'OUT-IN' similarity: a list of which IN vectors are most similar to a given OUT vector. They actually seem to tout the reverse, 'IN-OUT' similarity, as something useful. (That'd be the OUT vectors most similar to a given IN vector.)

The latest versions of gensim introduce a KeyedVectors class for representing a set of word-vectors, keyed by string, separate from the specific Word2Vec model or other training method. You could potentially create an extra KeyedVectors instance that replaces the usual syn0 with syn1neg, to get lists of OUT vectors similar to a target vector (and thus calculate top-n 'IN-OUT' similarities or even 'OUT-OUT' similarities).

For example, this might work (I haven't tested it):

outv = KeyedVectors()
outv.vocab = model.wv.vocab  # same
outv.index2word = model.wv.index2word  # same
outv.syn0 = model.syn1neg  # different
inout_similars = outv.most_similar(positive=[model['cousin']])

syn1 only exists when using hierarchical-sampling, and it's less clear what an "output embedding" for an individual word would be there. (There are multiple output nodes corresponding to predicting any one word, and they all need to be closer to their proper respective 0/1 values to predict a single word. So unlike with `syn1neg, there's no one place to read a vector that means a single word's output. You might have to calculate/approximate some set of hidden->output weights that would drive those multiple output nodes to the right values.)

like image 71
gojomo Avatar answered Oct 22 '22 19:10

gojomo