Python: What is the "size" parameter in Gensim Word2vec model class

Tags:

I have been struggling to understand the use of size parameter in the gensim.models.Word2Vec

From the Gensim documentation, size is the dimensionality of the vector. Now, as far as my knowledge goes, word2vec creates a vector of the probability of closeness with the other words in the sentence for each word. So, suppose if my vocab size is 30 then how does it create a vector with the dimension greater than 30? Can anyone please brief me on the optimal value of Word2Vec size?

Thank you.

659

asked Aug 01 '17 18:08

Krishnang K Dalal

1 Answers

size is, as you note, the dimensionality of the vector.

Word2Vec needs large, varied text examples to create its 'dense' embedding vectors per word. (It's the competition between many contrasting examples during training which allows the word-vectors to move to positions that have interesting distances and spatial-relationships with each other.)

If you only have a vocabulary of 30 words, word2vec is unlikely an appropriate technology. And if trying to apply it, you'd want to use a vector size much lower than your vocabulary size – ideally much lower. For example, texts containing many examples of each of tens-of-thousands of words might justify 100-dimensional word-vectors.

Using a higher dimensionality than vocabulary size would more-or-less guarantee 'overfitting'. The training could tend toward an idiosyncratic vector for each word – essentially like a 'one-hot' encoding – that would perform better than any other encoding, because there's no cross-word interference forced by representing a larger number of words in a smaller number of dimensions.

That'd mean a model that does about as well as possible on the Word2Vec internal nearby-word prediction task – but then awful on other downstream tasks, because there's been no generalizable relative-relations knowledge captured. (The cross-word interference is what the algorithm needs, over many training cycles, to incrementally settle into an arrangement where similar words must be similar in learned weights, and contrasting words different.)

113

answered Jan 02 '23 14:01

gojomo

Related questions
                            
                                What is the C++ equivalent to 'r' prefix with strings in Python?
                            
                                Finding tan inverse in python [duplicate]
                            
                                How to access all dictionaries within a dictionary where a specific key has a particular value
                            
                                How do I get the first name and last name of a logged in user in Django?
                            
                                SQLAlchemy Many-To-Many join
                            
                                are there any boto3 + MFA examples out there?
                            
                                Finding elements in a pandas dataframe
                            
                                How to randomly generate really small numbers?
                            
                                difference between similar() and concordance in nltk
                            
                                How to clone from specific branch from Git using Gitpython
                            
                                openpyxl read tables from existing data book example?
                            
                                Python 3 type check not works with use typing module?
                            
                                Converting .py to .exe with Anaconda
                            
                                Why ImportError: No module named lightgbm
                            
                                Object of type 'bytes' is not JSON serializable when upgrading my python environment
                            
                                PyCharm OpenCV- autocomplete with import cv2.cv2, no errors with import cv2
                            
                                Get github username through primary email
                            
                                Drop duplicates of one column based on value in another column, Python, Pandas
                            
                                aiohttp module - import error
                            
                                How can i set query parameter dynamically to request.GET in Django

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python: What is the "size" parameter in Gensim Word2vec model class

Tags:

python

gensim

word2vec

Krishnang K Dalal

People also ask

1 Answers

gojomo

Recent Activity

Donate For Us