Why does word2vec use 2 representations for each word?

Tags:

word2vec

I am trying to understand why word2vec's skipgram model has 2 representations for each word (the hidden representation which is the word embedding) and the output representation (also called context word embedding) . Is this just for generality where the context can be anything (not just words) or is there a more fundamental reason

852

asked Apr 01 '15 01:04

vkmv

2 Answers

I recommend you to read this article about Word2Vec : http://arxiv.org/pdf/1402.3722v1.pdf

They give an intuition about why two representations in a footnote : it is not likely that a word appears in its own context, so you would want to minimize the probability p(w|w). But if you use the same vectors for w as context than for w as center word, you cannot minimize p(w|w) (computed via the dot product) if you are to keep word embeddings in the unit circle.

But it is just an intuition, I don't know if there is any clear justification to this...

IMHO, the real reason why you use different representations is because you manipulate entities of different nature. "dog" as a context is not to be considered the same as "dog" as a center word because they are not. You basicly manipulate big matrices of occurences (word,context), trying to maximize the probability of these pairs that actually happen. Theoreticaly you could use as contexts bigrams, trying to maximize for instance the probability of (word="for", context="to maximize"), and you would assign a vector representation to "to maximize". We don't do this because there would be too many representations to compute, and we would have a reeeeeally sparse matrix, but I think the idea is here : the fact that we use "1-grams" as context is just a particular case of all the kinds of context we could use.

That's how I see it, and if it's wrong please correct !

answered Oct 22 '22 17:10

HediBY

Check the footnote on page 2 of this: http://arxiv.org/pdf/1402.3722v1.pdf

This gives a quite clear intuition for the problem.

But you can also use only one vector to represent a word. Check this (Stanford CS 224n) https://youtu.be/ERibwqs9p38?t=2064

I am not sure how that will be implemented(neither does the video explains).

answered Oct 22 '22 18:10

dust

Related questions
                            
                                Gensim Word2vec : Semantic Similarity
                            
                                How to download word2vec?
                            
                                Using word2vec to classify words in categories
                            
                                Ensure the gensim generate the same Word2Vec model for different runs on the same data
                            
                                Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
                            
                                FastText using pre-trained word vector for text classification
                            
                                How are number of iterations and number of partitions releated in Apache spark Word2Vec?
                            
                                How to use vector representation of words (as obtained from Word2Vec,etc) as features for a classifier?
                            
                                How does gensim calculate doc2vec paragraph vectors
                            
                                Get weight matrices from gensim word2Vec
                            
                                How can I tell if Gensim Word2Vec is using the C compiler?
                            
                                How to access/use Google's pre-trained Word2Vec model without manually downloading the model?
                            
                                How to use TaggedDocument in gensim?
                            
                                Using a Word2Vec model pre-trained on wikipedia
                            
                                How does word2vec or skip-gram model convert words to vector?
                            
                                In spacy, how to use your own word2vec model created in gensim?
                            
                                What does a weighted word embedding mean?
                            
                                Why does word2Vec use cosine similarity?
                            
                                Gensim train word2vec on wikipedia - preprocessing and parameters
                            
                                word2vec - what is best? add, concatenate or average word vectors?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With