How can I access output embedding(output vector) in gensim word2vec?

Tags:

I want to use output embedding of word2vec such as in this paper (Improving document ranking with dual word embeddings).

I know input vectors are in syn0, output vectors are in syn1 and syn1neg if negative sampling.

But when I calculated most_similar with output vector, I got same result in some ranges because of removing syn1 or syn1neg.

Here is what I got.

Click to copy

IN[1]: model = Word2Vec.load('test_model.model')

IN[2]: model.most_similar([model.syn1neg[0]])

OUT[2]: [('of', -0.04402521997690201),
('has', -0.16387106478214264),
('in', -0.16650712490081787),
('is', -0.18117375671863556),
('by', -0.2527652978897095),
('was', -0.254993200302124),
('from', -0.2659570872783661),
('the', -0.26878535747528076),
('on', -0.27521973848342896),
('his', -0.2930959463119507)]

but another syn1neg numpy vector is already similar output.

Click to copy

IN[3]: model.most_similar([model.syn1neg[50]])

OUT[3]: [('of', -0.07884830236434937),
('has', -0.16942456364631653),
('the', -0.1771494299173355),
('his', -0.2043554037809372),
('is', -0.23265135288238525),
('in', -0.24725285172462463),
('by', -0.27772971987724304),
('was', -0.2979024648666382),
('time', -0.3547973036766052),
('he', -0.36455872654914856)]

I want to get output numpy arrays(negative or not) with preserved during training.

Let me know how can I access pure syn1 or syn1neg, or code, or some word2vec module can get output embedding.

592

asked Mar 02 '17 11:03

Suin SEO

1 Answers

With negative-sampling, syn1neg weights are per-word, and in the same order as syn0.

The mere fact that your two examples give similar results doesn't necessarily indicate anything is wrong. The words are by default sorted by frequency, so the early words (including those in position 0 and 50) are very-frequent words with very-generic cooccurrence-based meanings (that may all be close to each other).

Pick a medium-frequency word with a more distinct meaning, and you may get more meaningful results (if your corpus/settings/needs are sufficiently like those of the 'dual word embeddings' paper). For example, you might want to compare:

Click to copy

model.most_similar('cousin')

...with...

Click to copy

model.most_similar(positive=[model.syn1neg[model.vocab['cousin'].index])

However, in all cases the existing most_similar() method only looks for similar-vectors in syn0 – the 'IN' vectors of the paper's terminology. So I believe the above code would only really be computing what the paper might call 'OUT-IN' similarity: a list of which IN vectors are most similar to a given OUT vector. They actually seem to tout the reverse, 'IN-OUT' similarity, as something useful. (That'd be the OUT vectors most similar to a given IN vector.)

The latest versions of gensim introduce a KeyedVectors class for representing a set of word-vectors, keyed by string, separate from the specific Word2Vec model or other training method. You could potentially create an extra KeyedVectors instance that replaces the usual syn0 with syn1neg, to get lists of OUT vectors similar to a target vector (and thus calculate top-n 'IN-OUT' similarities or even 'OUT-OUT' similarities).

For example, this might work (I haven't tested it):

Click to copy

outv = KeyedVectors()
outv.vocab = model.wv.vocab  # same
outv.index2word = model.wv.index2word  # same
outv.syn0 = model.syn1neg  # different
inout_similars = outv.most_similar(positive=[model['cousin']])

syn1 only exists when using hierarchical-sampling, and it's less clear what an "output embedding" for an individual word would be there. (There are multiple output nodes corresponding to predicting any one word, and they all need to be closer to their proper respective 0/1 values to predict a single word. So unlike with `syn1neg, there's no one place to read a vector that means a single word's output. You might have to calculate/approximate some set of hidden->output weights that would drive those multiple output nodes to the right values.)

answered Oct 22 '22 19:10

gojomo

Related questions
                            
                                Change the Color and Font of QString or QLineEdit
                            
                                Run a foreign exe inside a Python GUI (PyQt)
                            
                                AES Encryption in PowerShell and Python
                            
                                Asyncio with Django
                            
                                Showing Deprecation Warnings Only for a Specific Version When Testing Django
                            
                                How to count files inside zip in AWS S3 without downloading it?
                            
                                Call plpgsql Function from a PL/Python Function in PostgreSQL
                            
                                How can I prevent spacy's tokenizer from splitting a specific substring when tokenizing a string?
                            
                                Error when profiling an otherwise perfectly working multiprocessing python script with cProfile
                            
                                iterate over days (pandas)
                            
                                Python wsgi:ssl-error Can't connect to HTTPS URL because the SSL module is not available
                            
                                Python requests post json raw data
                            
                                Using np.where to find matching row in 2D array
                            
                                Use a custom failure message for `assertRaises()` in Python?
                            
                                How can I unit test the jinja2 template logic?
                            
                                What is the best method for using Datashader to plot data from a NumPy array?
                            
                                Prevent long lines getting wrapped in ruamel.yaml
                            
                                Python interface pattern and unit test code coverage
                            
                                Most important original feature(s) of Principal Component Analysis
                            
                                Conditional color with matplotlib scatter

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I access output embedding(output vector) in gensim word2vec?

Tags:

python

numpy

gensim

word2vec

Suin SEO

People also ask

1 Answers

gojomo

Recent Activity

Donate For Us