When using for example gensim, word2vec or a similar method for training your embedding vectors I was wonder what is a good ratio or is there a preferred ratio between the embedding dimension to vocabulary size ? Also how does that change with more data coming along ?
As I am still on the topic how would one chose a good window size when training your embedding vectors ?
I am asking this because I am not training my network with a real-life language dictionary, but rather the sentences would describe relationships between processes and files and other processes and so on. For example a sentence in my text corpus would look like:
smss.exe irp_mj_create systemdrive windows system32 ntdll dll DesiredAccess: Execute/Traverse, Synchronize, Disposition: Open, Options: , Attributes: n/a, ShareMode: Read, AllocationSize: n/a, OpenResult: Opened"
As you may imagine the variations are numberous but the question still remains how can I fine tune these hyperparameters the best way so that the embedding space will not over-fit but also have enough meaningful features for each word.
Thanks,
Gabriel
If we're in a hurry, one rule of thumb is to use the fourth root of the total number of unique categorical elements while another is that the embedding dimension should be approximately 1.6 times the square root of the number of unique elements in the category, and no less than 600.
The vocabulary size is equal to the size of the input vector in the model. The input vector is created using one hot encoding, i.e. the vector has a “1” in the position in the vector assigned to the target word and “0” in all other positions.
The key factors for deciding on the optimal embedding dimension are mainly related to the availability of computing resources (smaller is better, so if there's no difference in results and you can halve the dimensions, do so), task and (most importantly) quantity of supervised training examples - the choice of ...
The embedding dimension is defined as the length m of the used single vector “butter embedding space” that can reconstruct the successive phase space of a process. From: Design, Analysis, and Applications of Renewable Energy Systems, 2021.
I don't recall any specific papers for this problem, but the question feels a bit weird - in general, if I'd have a great model but wanted to switch to a vocabulary that is twice or ten times bigger, I would not change the embedding dimensions.
IMHO they're quite orthogonal, unrelated parameters. The key factors for deciding on the optimal embedding dimension are mainly related to the availability of computing resources (smaller is better, so if there's no difference in results and you can halve the dimensions, do so), task and (most importantly) quantity of supervised training examples - the choice of embedding dimensions will determine how much you will compress / intentionally bottleneck the lexical information; larger dimensionality will allow your model to distinguish more lexical detail which is good if and only if your supervised data has enough information to use that lexical detail properly, but if it's not there, then the extra lexical information will overfit and a smaller embedding dimensionality will generalize better. So a ratio between the vocabulary size and the embedding dimension is not (IMHO, I can't give evidence, it's just practical experience) something to look at, since the best size for embedding dimension is decided by where you use the embeddings, not the data on which you train the embeddings.
In any case, this seems like a situation where your mileage will vary - any theory and discussion will be interesting, but your task and text domain is quite specific, findings of general NLP may or may not apply to your case, and it would be best to get empirical evidence for what works on your data. Train embeddings with 64/128/256 or 100/200/400 or whatever sizes, train models using each of those, and compare the effects; that'll take less effort (of people, not GPUs) than thinking about what the effects should be.
This Google Developers blog post says:
Well, the following "formula" provides a general rule of thumb about the number of embedding dimensions:
embedding_dimensions = number_of_categories**0.25
That is, the embedding vector dimension should be the 4th root of the number of categories.
Interestingly, the Word2vec Wikipedia article says (emphasis mine):
Nevertheless, for skip-gram models trained in medium size corpora, with 50 dimensions, a window size of 15 and 10 negative samples seems to be a good parameter setting.
Assuming a standard-ish sized vocabulary of 1.5 million words, this rule of thumb comes surprisingly close:
50 == 1.5e6 ** 0.2751
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With