CBOW v.s. skip-gram: why invert context and target words?

Tags:

In this page, it is said that:

[...] skip-gram inverts contexts and targets, and tries to predict each context word from its target word [...]

However, looking at the training dataset it produces, the content of the X and Y pair seems to be interexchangeable, as those two pairs of (X, Y):

(quick, brown), (brown, quick)

So, why distinguish that much between context and targets if it is the same thing in the end?

Also, doing Udacity's Deep Learning course exercise on word2vec, I wonder why they seem to do the difference between those two approaches that much in this problem:

An alternative to skip-gram is another Word2Vec model called CBOW (Continuous Bag of Words). In the CBOW model, instead of predicting a context word from a word vector, you predict a word from the sum of all the word vectors in its context. Implement and evaluate a CBOW model trained on the text8 dataset.

Would not this yields the same results?

237

asked Jul 10 '16 01:07

Guillaume Chevalier

2 Answers

Here is my oversimplified and rather naive understanding of the difference:

As we know, CBOW is learning to predict the word by the context. Or maximize the probability of the target word by looking at the context. And this happens to be a problem for rare words. For example, given the context yesterday was a really [...] day CBOW model will tell you that most probably the word is beautiful or nice. Words like delightful will get much less attention of the model, because it is designed to predict the most probable word. This word will be smoothed over a lot of examples with more frequent words.

On the other hand, the skip-gram model is designed to predict the context. Given the word delightful it must understand it and tell us that there is a huge probability that the context is yesterday was really [...] day, or some other relevant context. With skip-gram the word delightful will not try to compete with the word beautiful but instead, delightful+context pairs will be treated as new observations.

UPDATE

Thanks to @0xF for sharing this article

According to Mikolov

Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.

CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words

One more addition to the subject is found here:

In the "skip-gram" mode alternative to "CBOW", rather than averaging the context words, each is used as a pairwise training example. That is, in place of one CBOW example such as [predict 'ate' from average('The', 'cat', 'the', 'mouse')], the network is presented with four skip-gram examples [predict 'ate' from 'The'], [predict 'ate' from 'cat'], [predict 'ate' from 'the'], [predict 'ate' from 'mouse']. (The same random window-reduction occurs, so half the time that would just be two examples, of the nearest words.)

154

answered Oct 15 '22 19:10

Serhiy

It has to do with what exactly you're calculating at any given point. The difference will become clearer if you start to look at models that incorporate a larger context for each probability calculation.

In skip-gram, you're calculating the context word(s) from the word at the current position in the sentence; you're "skipping" the current word (and potentially a bit of the context) in your calculation. The result can be more than one word (but not if your context window is just one word long).

In CBOW, you're calculating the current word from the context word(s), so you will only ever have one word as a result.

answered Oct 15 '22 17:10

Clay

Related questions
                            
                                Machine Learning and Natural Language Processing [closed]
                            
                                Entity Extraction/Recognition with free tools while feeding Lucene Index
                            
                                How to use Gensim doc2vec with pre-trained word vectors?
                            
                                Algorithms to detect phrases and keywords from text
                            
                                Load Pretrained glove vectors in python
                            
                                How to use Bert for long text classification?
                            
                                NLTK Named Entity Recognition with Custom Data
                            
                                Best way to identify and extract dates from text Python?
                            
                                Unsupervised Sentiment Analysis
                            
                                What do the BILOU tags mean in Named Entity Recognition?
                            
                                Text Summarization Evaluation - BLEU vs ROUGE
                            
                                gensim word2vec: Find number of words in vocabulary
                            
                                Improving the extraction of human names with nltk [closed]
                            
                                SpaCy OSError: Can't find model 'en'
                            
                                What is a projection layer in the context of neural networks?
                            
                                tag generation from a text content
                            
                                How to read values from numbers written as words?
                            
                                What Is the Difference Between POS Tagging and Shallow Parsing?
                            
                                How is WordPiece tokenization helpful to effectively deal with rare words problem in NLP?
                            
                                How can a sentence or a document be converted to a vector?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

CBOW v.s. skip-gram: why invert context and target words?

Tags:

tensorflow

deep-learning

nlp

word-embedding

word2vec

Guillaume Chevalier

People also ask

2 Answers

Serhiy

Clay

Recent Activity

Donate For Us