I have been using gensim's libraries to train a doc2Vec model. After experimenting with different datasets for training, I am fairly confused about what should be an ideal training data size for doc2Vec model?
I will be sharing my understanding here. Please feel free to correct me/suggest changes-
But what are the number of words used for training, in both these cases?
On a general note, we stop training a ML model, when the error graph reaches an "elbow point", where further training won't help significantly in decreasing error. Has any study being done in this direction- where doc2Vec model's training is stopped after reaching an elbow ?
Although the 20 document corpus seems small but the perk is it takes around 2 minutes to train the model. Its time for simple text cleaning before training the model. Building tokenizer object, stopwords set in English from NLTK. These are basic nltk operation done before training any model on text corpus.
To sum up, Doc2Vec works much better then Word2Vec model. But is is worth saying that for documents classification we need to somehow transform vectors of words made by Word2Vec to vectors of documents.
Doc2vec also uses and unsupervised learning approach to learn the document representation . The input of texts (i.e. word) per document can be various while the output is fixed-length vectors. Paragraph vector and word vectors are initialized.
There are no absolute guidelines - it depends a lot on your dataset and specific application goals. There's some discussion of the sizes of datasets used in published Doc2Vec
work at:
what is the minimum dataset size needed for good performance with doc2vec?
If your general-purpose corpus doesn't match your domain's vocabulary – including the same words, or using words in the same senses – that's a problem that can't be fixed with just "a lot of data". More data could just 'pull' word contexts and representations more towards generic, rather than domain-specific, values.
You really need to have your own quantitative, automated evaluation/scoring method, so you can measure whether results with your specific data and goals are sufficient, or improving with more data or other training tweaks.
Sometimes parameter tweaks can help get the most out of thin data – in particular, more training iterations or a smaller model (fewer vector-dimensions) can slightly offset some issues with small corpuses, sometimes. But the Word2Vec
/Doc2Vec
really benefit from lots of subtly-varied, domain-specific data - it's the constant, incremental tug-of-war between all the text-examples during training that helps the final representations settle into a useful constellation-of-arrangements, with the desired relative-distance/relative-direction properties.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With