I have been struggling to understand the use of size
parameter in the gensim.models.Word2Vec
From the Gensim documentation, size
is the dimensionality of the vector. Now, as far as my knowledge goes, word2vec creates a vector of the probability of closeness with the other words in the sentence for each word. So, suppose if my vocab
size is 30 then how does it create a vector with the dimension greater than 30? Can anyone please brief me on the optimal value of Word2Vec
size?
Thank you.
We can train the genism word2vec model with our own custom corpus as following: >>> model = Word2Vec(sent, min_count=1,size= 50,workers=3, window =3, sg = 1) Let's try to understand the hyperparameters of this model. size: The number of dimensions of the embeddings and the default is 100.
The standard Word2Vec pre-trained vectors, as mentioned above, have 300 dimensions. We have tended to use 200 or fewer, under the rationale that our corpus and vocabulary are much smaller than those of Google News, and so we need fewer dimensions to represent them.
To assess which word2vec model is best, simply calculate the distance for each pair, do it 200 times, sum up the total distance, and the smallest total distance will be your best model.
Word2vec ExampleMake the object using the class CountVectorizer. Write the data in the list which is to be fitted in the CountVectorizer. Data is fit in the object created from the class CountVectorizer. Apply a bag of word approach to count words in the data using vocabulary.
size
is, as you note, the dimensionality of the vector.
Word2Vec needs large, varied text examples to create its 'dense' embedding vectors per word. (It's the competition between many contrasting examples during training which allows the word-vectors to move to positions that have interesting distances and spatial-relationships with each other.)
If you only have a vocabulary of 30 words, word2vec is unlikely an appropriate technology. And if trying to apply it, you'd want to use a vector size much lower than your vocabulary size – ideally much lower. For example, texts containing many examples of each of tens-of-thousands of words might justify 100-dimensional word-vectors.
Using a higher dimensionality than vocabulary size would more-or-less guarantee 'overfitting'. The training could tend toward an idiosyncratic vector for each word – essentially like a 'one-hot' encoding – that would perform better than any other encoding, because there's no cross-word interference forced by representing a larger number of words in a smaller number of dimensions.
That'd mean a model that does about as well as possible on the Word2Vec internal nearby-word prediction task – but then awful on other downstream tasks, because there's been no generalizable relative-relations knowledge captured. (The cross-word interference is what the algorithm needs, over many training cycles, to incrementally settle into an arrangement where similar words must be similar in learned weights, and contrasting words different.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With