I am trying to understand why word2vec's skipgram model has 2 representations for each word (the hidden representation which is the word embedding) and the output representation (also called context word embedding) . Is this just for generality where the context can be anything (not just words) or is there a more fundamental reason
CBOW (continuous bag of words) and the skip-gram model are the two main architectures associated with word2vec. Given an input word, skip-gram will try to predict the words in context to the input whereas the CBOW model will take a variety of words and try to predict the missing one.
Word2vec Challenges Perhaps the biggest problem with word2vec is the inability to handle unknown or out-of-vocabulary (OOV) words. If your model hasn't encountered a word before, it will have no idea how to interpret it or how to build a vector for it.
The Word2Vec technique was therefore conceived with two goals in mind: reduce the size of the word encoding space (embedding space); compress in the word representation the most informative description for each word.
I recommend you to read this article about Word2Vec : http://arxiv.org/pdf/1402.3722v1.pdf
They give an intuition about why two representations in a footnote : it is not likely that a word appears in its own context, so you would want to minimize the probability p(w|w). But if you use the same vectors for w as context than for w as center word, you cannot minimize p(w|w) (computed via the dot product) if you are to keep word embeddings in the unit circle.
But it is just an intuition, I don't know if there is any clear justification to this...
IMHO, the real reason why you use different representations is because you manipulate entities of different nature. "dog" as a context is not to be considered the same as "dog" as a center word because they are not. You basicly manipulate big matrices of occurences (word,context), trying to maximize the probability of these pairs that actually happen. Theoreticaly you could use as contexts bigrams, trying to maximize for instance the probability of (word="for", context="to maximize"), and you would assign a vector representation to "to maximize". We don't do this because there would be too many representations to compute, and we would have a reeeeeally sparse matrix, but I think the idea is here : the fact that we use "1-grams" as context is just a particular case of all the kinds of context we could use.
That's how I see it, and if it's wrong please correct !
Check the footnote on page 2 of this: http://arxiv.org/pdf/1402.3722v1.pdf
This gives a quite clear intuition for the problem.
But you can also use only one vector to represent a word. Check this (Stanford CS 224n) https://youtu.be/ERibwqs9p38?t=2064
I am not sure how that will be implemented(neither does the video explains).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With