Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How much data is actually required to train a doc2Vec model?

I have been using gensim's libraries to train a doc2Vec model. After experimenting with different datasets for training, I am fairly confused about what should be an ideal training data size for doc2Vec model?

I will be sharing my understanding here. Please feel free to correct me/suggest changes-

  1. Training on a general purpose dataset- If I want to use a model trained on a general purpose dataset, in a specific use case, I need to train on a lot of data.
  2. Training on the context related dataset- If I want to train it on the data having the same context as my use case, usually the training data size can have a smaller size.

But what are the number of words used for training, in both these cases?

On a general note, we stop training a ML model, when the error graph reaches an "elbow point", where further training won't help significantly in decreasing error. Has any study being done in this direction- where doc2Vec model's training is stopped after reaching an elbow ?

like image 751
Shalabh Singh Avatar asked Jan 02 '18 10:01

Shalabh Singh


People also ask

How long does it take to train Doc2Vec?

Although the 20 document corpus seems small but the perk is it takes around 2 minutes to train the model. Its time for simple text cleaning before training the model. Building tokenizer object, stopwords set in English from NLTK. These are basic nltk operation done before training any model on text corpus.

Is Doc2Vec better than Word2Vec?

To sum up, Doc2Vec works much better then Word2Vec model. But is is worth saying that for documents classification we need to somehow transform vectors of words made by Word2Vec to vectors of documents.

How does Doc2Vec model work?

Doc2vec also uses and unsupervised learning approach to learn the document representation . The input of texts (i.e. word) per document can be various while the output is fixed-length vectors. Paragraph vector and word vectors are initialized.


1 Answers

There are no absolute guidelines - it depends a lot on your dataset and specific application goals. There's some discussion of the sizes of datasets used in published Doc2Vec work at:

what is the minimum dataset size needed for good performance with doc2vec?

If your general-purpose corpus doesn't match your domain's vocabulary – including the same words, or using words in the same senses – that's a problem that can't be fixed with just "a lot of data". More data could just 'pull' word contexts and representations more towards generic, rather than domain-specific, values.

You really need to have your own quantitative, automated evaluation/scoring method, so you can measure whether results with your specific data and goals are sufficient, or improving with more data or other training tweaks.

Sometimes parameter tweaks can help get the most out of thin data – in particular, more training iterations or a smaller model (fewer vector-dimensions) can slightly offset some issues with small corpuses, sometimes. But the Word2Vec/Doc2Vec really benefit from lots of subtly-varied, domain-specific data - it's the constant, incremental tug-of-war between all the text-examples during training that helps the final representations settle into a useful constellation-of-arrangements, with the desired relative-distance/relative-direction properties.

like image 148
gojomo Avatar answered Oct 14 '22 18:10

gojomo