I want to use output embedding of word2vec such as in this paper (Improving document ranking with dual word embeddings).
I know input vectors are in syn0, output vectors are in syn1 and syn1neg if negative sampling.
But when I calculated most_similar with output vector, I got same result in some ranges because of removing syn1 or syn1neg.
Here is what I got.
IN[1]: model = Word2Vec.load('test_model.model')
IN[2]: model.most_similar([model.syn1neg[0]])
OUT[2]: [('of', -0.04402521997690201),
('has', -0.16387106478214264),
('in', -0.16650712490081787),
('is', -0.18117375671863556),
('by', -0.2527652978897095),
('was', -0.254993200302124),
('from', -0.2659570872783661),
('the', -0.26878535747528076),
('on', -0.27521973848342896),
('his', -0.2930959463119507)]
but another syn1neg numpy vector is already similar output.
IN[3]: model.most_similar([model.syn1neg[50]])
OUT[3]: [('of', -0.07884830236434937),
('has', -0.16942456364631653),
('the', -0.1771494299173355),
('his', -0.2043554037809372),
('is', -0.23265135288238525),
('in', -0.24725285172462463),
('by', -0.27772971987724304),
('was', -0.2979024648666382),
('time', -0.3547973036766052),
('he', -0.36455872654914856)]
I want to get output numpy arrays(negative or not) with preserved during training.
Let me know how can I access pure syn1 or syn1neg, or code, or some word2vec module can get output embedding.
The output of the Word2vec neural net is a vocabulary in which each item has a vector attached to it, which can be fed into a deep-learning net or simply queried to detect relationships between words.
Word2Vec is a more recent model that embeds words in a lower-dimensional vector space using a shallow neural network. The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings.
To assess which word2vec model is best, simply calculate the distance for each pair, do it 200 times, sum up the total distance, and the smallest total distance will be your best model.
With negative-sampling, syn1neg
weights are per-word, and in the same order as syn0
.
The mere fact that your two examples give similar results doesn't necessarily indicate anything is wrong. The words are by default sorted by frequency, so the early words (including those in position 0 and 50) are very-frequent words with very-generic cooccurrence-based meanings (that may all be close to each other).
Pick a medium-frequency word with a more distinct meaning, and you may get more meaningful results (if your corpus/settings/needs are sufficiently like those of the 'dual word embeddings' paper). For example, you might want to compare:
model.most_similar('cousin')
...with...
model.most_similar(positive=[model.syn1neg[model.vocab['cousin'].index])
However, in all cases the existing most_similar()
method only looks for similar-vectors in syn0
– the 'IN' vectors of the paper's terminology. So I believe the above code would only really be computing what the paper might call 'OUT-IN' similarity: a list of which IN vectors are most similar to a given OUT vector. They actually seem to tout the reverse, 'IN-OUT' similarity, as something useful. (That'd be the OUT vectors most similar to a given IN vector.)
The latest versions of gensim introduce a KeyedVectors
class for representing a set of word-vectors, keyed by string, separate from the specific Word2Vec model or other training method. You could potentially create an extra KeyedVectors
instance that replaces the usual syn0
with syn1neg
, to get lists of OUT vectors similar to a target vector (and thus calculate top-n 'IN-OUT' similarities or even 'OUT-OUT' similarities).
For example, this might work (I haven't tested it):
outv = KeyedVectors()
outv.vocab = model.wv.vocab # same
outv.index2word = model.wv.index2word # same
outv.syn0 = model.syn1neg # different
inout_similars = outv.most_similar(positive=[model['cousin']])
syn1
only exists when using hierarchical-sampling, and it's less clear what an "output embedding" for an individual word would be there. (There are multiple output nodes corresponding to predicting any one word, and they all need to be closer to their proper respective 0/1 values to predict a single word. So unlike with `syn1neg, there's no one place to read a vector that means a single word's output. You might have to calculate/approximate some set of hidden->output weights that would drive those multiple output nodes to the right values.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With