I am using Doc2Vec
function of gensim in Python to convert a document to a vector.
An example of usage
model = Doc2Vec(documents, size=100, window=8, min_count=5, workers=4)
How should I interpret the size
parameter. I know that if I set size = 100
, the length of output vector will be 100, but what does it mean? For instance, if I increase size
to 200, what is the difference?
Word2Vec captures distributed representation of a word which essentially means, multiple neurons capture a single concept (concept can be word meaning/sentiment/part of speech etc.), and also a single neuron contributes to multiple concepts.
These concepts are automatically learnt and not pre-defined, hence you can think of them as latent/hidden. Also for the same reason, the word vectors can be used for multiple applications.
More is the size parameter, more will be the capacity of your neural network to represent these concepts, but more data will be required to train these vectors (as they are initialised randomly). In absence of sufficient number of sentences/computing power, its better to keep the size
small.
Doc2Vec follows slightly different neural network architecture as compared to Word2Vec, but the meaning of size
is analogous.
The difference is the detail, that the model can capture. Generally, the more dimensions you give Word2Vec, the better the model - up to a certain point.
Normally the size is between 100-300. You always have to consider that more dimensions also mean, that more memory is needed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With