What is the preferred ratio between the vocabulary size and embedding dimension?

Tags:

When using for example gensim, word2vec or a similar method for training your embedding vectors I was wonder what is a good ratio or is there a preferred ratio between the embedding dimension to vocabulary size ? Also how does that change with more data coming along ?

As I am still on the topic how would one chose a good window size when training your embedding vectors ?

I am asking this because I am not training my network with a real-life language dictionary, but rather the sentences would describe relationships between processes and files and other processes and so on. For example a sentence in my text corpus would look like:

smss.exe irp_mj_create systemdrive windows system32 ntdll dll DesiredAccess: Execute/Traverse, Synchronize, Disposition: Open, Options: , Attributes: n/a, ShareMode: Read, AllocationSize: n/a, OpenResult: Opened"

As you may imagine the variations are numberous but the question still remains how can I fine tune these hyperparameters the best way so that the embedding space will not over-fit but also have enough meaningful features for each word.

Thanks,

Gabriel

588

asked Jan 27 '18 19:01

Gabriel Bercea

2 Answers

Ratio is not what you're aiming for

I don't recall any specific papers for this problem, but the question feels a bit weird - in general, if I'd have a great model but wanted to switch to a vocabulary that is twice or ten times bigger, I would not change the embedding dimensions.

IMHO they're quite orthogonal, unrelated parameters. The key factors for deciding on the optimal embedding dimension are mainly related to the availability of computing resources (smaller is better, so if there's no difference in results and you can halve the dimensions, do so), task and (most importantly) quantity of supervised training examples - the choice of embedding dimensions will determine how much you will compress / intentionally bottleneck the lexical information; larger dimensionality will allow your model to distinguish more lexical detail which is good if and only if your supervised data has enough information to use that lexical detail properly, but if it's not there, then the extra lexical information will overfit and a smaller embedding dimensionality will generalize better. So a ratio between the vocabulary size and the embedding dimension is not (IMHO, I can't give evidence, it's just practical experience) something to look at, since the best size for embedding dimension is decided by where you use the embeddings, not the data on which you train the embeddings.

In any case, this seems like a situation where your mileage will vary - any theory and discussion will be interesting, but your task and text domain is quite specific, findings of general NLP may or may not apply to your case, and it would be best to get empirical evidence for what works on your data. Train embeddings with 64/128/256 or 100/200/400 or whatever sizes, train models using each of those, and compare the effects; that'll take less effort (of people, not GPUs) than thinking about what the effects should be.

117

answered Sep 21 '22 07:09

Peteris

This Google Developers blog post says:

Well, the following "formula" provides a general rule of thumb about the number of embedding dimensions:
embedding_dimensions =  number_of_categories**0.25
That is, the embedding vector dimension should be the 4th root of the number of categories.

Interestingly, the Word2vec Wikipedia article says (emphasis mine):

Nevertheless, for skip-gram models trained in medium size corpora, with 50 dimensions, a window size of 15 and 10 negative samples seems to be a good parameter setting.

Assuming a standard-ish sized vocabulary of 1.5 million words, this rule of thumb comes surprisingly close:

50 == 1.5e6 ** 0.2751

answered Sep 23 '22 07:09

Tom Hale

Related questions
                            
                                Character-Word Embeddings from lm_1b in Keras
                            
                                Incremental training of ALS model
                            
                                How to apply machine learning to fuzzy matching
                            
                                Multiple sessions and graphs in Tensorflow (in the same process)
                            
                                What are some good ways of estimating 'approximate' semantic similarity between sentences?
                            
                                Compute the gradient of the SVM loss function
                            
                                LabelPropagation - How to avoid division by zero?
                            
                                Extract target from Tensorflow PrefetchDataset
                            
                                Why the BIAS is necessary in ANN? Should we have separate BIAS for each layer?
                            
                                Why is a simple 2-layer Neural Network unable to learn 0,0 sequence?
                            
                                Is there some .NET machine learning library that could, for example, suggest tags for a question? [closed]
                            
                                ValueError: Input 0 is incompatible with layer conv1d_1: expected ndim=3, found ndim=4
                            
                                Summarizing a Wikipedia Article
                            
                                Custom cluster colors of SciPy dendrogram in Python (link_color_func?)
                            
                                Better text documents clustering than tf/idf and cosine similarity?
                            
                                How to evolve weights of a neural network in Neuroevolution?
                            
                                Implementing a linear, binary SVM (support vector machine)
                            
                                GBM R function: get variable importance separately for each class
                            
                                How do I make a U-matrix?
                            
                                Computing TF-IDF on the whole dataset or only on training data?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the preferred ratio between the vocabulary size and embedding dimension?

Tags:

machine-learning

keras

nltk

word-embedding

nltk-trainer