I am trying to train a word2vec model on very short phrases (5 grams). Since each sentence or example is very short, I believe the window size I can use can atmost be 2. I am trying to understand what the implications of such a small window size are on the quality of the learned model, so that I can understand whether my model has learnt something meaningful or not. I tried training a word2vec model on 5-grams but it appears the learnt model does not capture semantics etc very well.
I am using the following test to evaluate the accuracy of model: https://code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
I used gensim.Word2Vec to train a model and here is a snippet of my accuracy scores (using a window size of 2)
[{'correct': 2, 'incorrect': 304, 'section': 'capital-common-countries'},
{'correct': 2, 'incorrect': 453, 'section': 'capital-world'},
{'correct': 0, 'incorrect': 86, 'section': 'currency'},
{'correct': 2, 'incorrect': 703, 'section': 'city-in-state'},
{'correct': 123, 'incorrect': 183, 'section': 'family'},
{'correct': 21, 'incorrect': 791, 'section': 'gram1-adjective-to-adverb'},
{'correct': 8, 'incorrect': 544, 'section': 'gram2-opposite'},
{'correct': 284, 'incorrect': 976, 'section': 'gram3-comparative'},
{'correct': 67, 'incorrect': 863, 'section': 'gram4-superlative'},
{'correct': 41, 'incorrect': 951, 'section': 'gram5-present-participle'},
{'correct': 6, 'incorrect': 1089, 'section': 'gram6-nationality-adjective'},
{'correct': 171, 'incorrect': 1389, 'section': 'gram7-past-tense'},
{'correct': 56, 'incorrect': 936, 'section': 'gram8-plural'},
{'correct': 52, 'incorrect': 705, 'section': 'gram9-plural-verbs'},
{'correct': 835, 'incorrect': 9973, 'section': 'total'}]
I also tried running the demo-word-accuracy.sh script outlined here with a window size of 2 and get poor accuracy as well:
Sample output:
capital-common-countries:
ACCURACY TOP1: 19.37 % (98 / 506)
Total accuracy: 19.37 % Semantic accuracy: 19.37 % Syntactic accuracy: -nan %
capital-world:
ACCURACY TOP1: 10.26 % (149 / 1452)
Total accuracy: 12.61 % Semantic accuracy: 12.61 % Syntactic accuracy: -nan %
currency:
ACCURACY TOP1: 6.34 % (17 / 268)
Total accuracy: 11.86 % Semantic accuracy: 11.86 % Syntactic accuracy: -nan %
city-in-state:
ACCURACY TOP1: 11.78 % (185 / 1571)
Total accuracy: 11.83 % Semantic accuracy: 11.83 % Syntactic accuracy: -nan %
family:
ACCURACY TOP1: 57.19 % (175 / 306)
Total accuracy: 15.21 % Semantic accuracy: 15.21 % Syntactic accuracy: -nan %
gram1-adjective-to-adverb:
ACCURACY TOP1: 6.48 % (49 / 756)
Total accuracy: 13.85 % Semantic accuracy: 15.21 % Syntactic accuracy: 6.48 %
gram2-opposite:
ACCURACY TOP1: 17.97 % (55 / 306)
Total accuracy: 14.09 % Semantic accuracy: 15.21 % Syntactic accuracy: 9.79 %
gram3-comparative:
ACCURACY TOP1: 34.68 % (437 / 1260)
Total accuracy: 18.13 % Semantic accuracy: 15.21 % Syntactic accuracy: 23.30 %
gram4-superlative:
ACCURACY TOP1: 14.82 % (75 / 506)
Total accuracy: 17.89 % Semantic accuracy: 15.21 % Syntactic accuracy: 21.78 %
gram5-present-participle:
ACCURACY TOP1: 19.96 % (198 / 992)
Total accuracy: 18.15 % Semantic accuracy: 15.21 % Syntactic accuracy: 21.31 %
gram6-nationality-adjective:
ACCURACY TOP1: 35.81 % (491 / 1371)
Total accuracy: 20.76 % Semantic accuracy: 15.21 % Syntactic accuracy: 25.14 %
gram7-past-tense:
ACCURACY TOP1: 19.67 % (262 / 1332)
Total accuracy: 20.62 % Semantic accuracy: 15.21 % Syntactic accuracy: 24.02 %
gram8-plural:
ACCURACY TOP1: 35.38 % (351 / 992)
Total accuracy: 21.88 % Semantic accuracy: 15.21 % Syntactic accuracy: 25.52 %
gram9-plural-verbs:
ACCURACY TOP1: 20.00 % (130 / 650)
Total accuracy: 21.78 % Semantic accuracy: 15.21 % Syntactic accuracy: 25.08 %
Questions seen / total: 12268 19544 62.77 %
However the word2vec site claims its possible to obtain an accuracy of ~60% on these tasks. Hence I would like to gain some insights into the effect of these hyperparameters like window size and how they affect quality of learnt models.
The Gensim default window size is 5 (two words before and two words after the input word, in addition to the input word itself). The number of negative samples is another factor of the training process.
The standard Word2Vec pre-trained vectors, as mentioned above, have 300 dimensions. We have tended to use 200 or fewer, under the rationale that our corpus and vocabulary are much smaller than those of Google News, and so we need fewer dimensions to represent them.
Original Word2Vec cannot control the number of iterations, since it scans through the corpus and optimizes the cosine distance between word–context pairs. The training time (iteration number) is thus proportional to the size of the corpus. This makes the algorithm hard to train on big corpus.
To assess which word2vec model is best, simply calculate the distance for each pair, do it 200 times, sum up the total distance, and the smallest total distance will be your best model.
Very low scores on the analogy-questions are more likely due to limitations in the amount or quality of your training data, rather than mistuned parameters. (If your training phrases are really only 5 words each, they may not capture the same rich relations as can be discovered from datasets with full sentences.)
You could use a window of 5 on your phrases – the training code trims the window to what's available on either side – but then every word of each phrase affects all of the other words. That might be OK: one of the Google word2vec papers ("Distributed Representations of Words and Phrases and their Compositionality", https://arxiv.org/abs/1310.4546) mentions that to get the best accuracy on one of their phrase tasks, they used "the entire sentence for the context". (On the other hand, on one English corpus of short messages, I found a window size of just 2 created the vectors that scored best on the analogies-evaluation, so larger isn't necessarily better.)
A paper by Levy & Goldberg, "Dependency-Based Word Embeddings", speaks a bit about the qualitative effect of window-size:
https://levyomer.files.wordpress.com/2014/04/dependency-based-word-embeddings-acl-2014.pdf
They find:
Larger windows tend to capture more topic/domain information: what other words (of any type) are used in related discussions? Smaller windows tend to capture more about word itself: what other words are functionally similar? (Their own extension, the dependency-based embeddings, seems best at finding most-similar words, synonyms or obvious-alternatives that could drop-in as replacements of the origin word.)
To your question: "I am trying to understand what the implications of such a small window size are on the quality of the learned model".
For example "stackoverflow great website for programmers" with 5 words (suppose we save the stop words great and for here) if the window size is 2 then the vector of word "stackoverflow" is directly affected by the word "great" and "website", if the window size is 5 "stackoverflow" can be directly affected by two more words "for" and "programmers". The 'affected' here means it will pull the vector of two words closer.
So it depends on the material you are using for training, if the window size of 2 can capture the context of a word, but 5 is chosen, it will decrease the quality of the learnt model, and vise versa.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With