Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does word2vec use 2 representations for each word?

Tags:

word2vec

I am trying to understand why word2vec's skipgram model has 2 representations for each word (the hidden representation which is the word embedding) and the output representation (also called context word embedding) . Is this just for generality where the context can be anything (not just words) or is there a more fundamental reason

like image 852
vkmv Avatar asked Apr 01 '15 01:04

vkmv


People also ask

What are the 2 architectures of Word2Vec?

CBOW (continuous bag of words) and the skip-gram model are the two main architectures associated with word2vec. Given an input word, skip-gram will try to predict the words in context to the input whereas the CBOW model will take a variety of words and try to predict the missing one.

What are the challenges of representing Word2Vec of a text?

Word2vec Challenges Perhaps the biggest problem with word2vec is the inability to handle unknown or out-of-vocabulary (OOV) words. If your model hasn't encountered a word before, it will have no idea how to interpret it or how to build a vector for it.

What are the two important improvements by representing a term in Word2Vec rather than one hot?

The Word2Vec technique was therefore conceived with two goals in mind: reduce the size of the word encoding space (embedding space); compress in the word representation the most informative description for each word.


2 Answers

I recommend you to read this article about Word2Vec : http://arxiv.org/pdf/1402.3722v1.pdf

They give an intuition about why two representations in a footnote : it is not likely that a word appears in its own context, so you would want to minimize the probability p(w|w). But if you use the same vectors for w as context than for w as center word, you cannot minimize p(w|w) (computed via the dot product) if you are to keep word embeddings in the unit circle.

But it is just an intuition, I don't know if there is any clear justification to this...

IMHO, the real reason why you use different representations is because you manipulate entities of different nature. "dog" as a context is not to be considered the same as "dog" as a center word because they are not. You basicly manipulate big matrices of occurences (word,context), trying to maximize the probability of these pairs that actually happen. Theoreticaly you could use as contexts bigrams, trying to maximize for instance the probability of (word="for", context="to maximize"), and you would assign a vector representation to "to maximize". We don't do this because there would be too many representations to compute, and we would have a reeeeeally sparse matrix, but I think the idea is here : the fact that we use "1-grams" as context is just a particular case of all the kinds of context we could use.

That's how I see it, and if it's wrong please correct !

like image 65
HediBY Avatar answered Oct 22 '22 17:10

HediBY


Check the footnote on page 2 of this: http://arxiv.org/pdf/1402.3722v1.pdf

This gives a quite clear intuition for the problem.

But you can also use only one vector to represent a word. Check this (Stanford CS 224n) https://youtu.be/ERibwqs9p38?t=2064

I am not sure how that will be implemented(neither does the video explains).

like image 39
dust Avatar answered Oct 22 '22 18:10

dust