How much data is actually required to train a doc2Vec model?

Tags:

I have been using gensim's libraries to train a doc2Vec model. After experimenting with different datasets for training, I am fairly confused about what should be an ideal training data size for doc2Vec model?

I will be sharing my understanding here. Please feel free to correct me/suggest changes-

Training on a general purpose dataset- If I want to use a model trained on a general purpose dataset, in a specific use case, I need to train on a lot of data.
Training on the context related dataset- If I want to train it on the data having the same context as my use case, usually the training data size can have a smaller size.

But what are the number of words used for training, in both these cases?

On a general note, we stop training a ML model, when the error graph reaches an "elbow point", where further training won't help significantly in decreasing error. Has any study being done in this direction- where doc2Vec model's training is stopped after reaching an elbow ?

751

asked Jan 02 '18 10:01

Shalabh Singh

1 Answers

There are no absolute guidelines - it depends a lot on your dataset and specific application goals. There's some discussion of the sizes of datasets used in published Doc2Vec work at:

what is the minimum dataset size needed for good performance with doc2vec?

If your general-purpose corpus doesn't match your domain's vocabulary – including the same words, or using words in the same senses – that's a problem that can't be fixed with just "a lot of data". More data could just 'pull' word contexts and representations more towards generic, rather than domain-specific, values.

You really need to have your own quantitative, automated evaluation/scoring method, so you can measure whether results with your specific data and goals are sufficient, or improving with more data or other training tweaks.

Sometimes parameter tweaks can help get the most out of thin data – in particular, more training iterations or a smaller model (fewer vector-dimensions) can slightly offset some issues with small corpuses, sometimes. But the Word2Vec/Doc2Vec really benefit from lots of subtly-varied, domain-specific data - it's the constant, incremental tug-of-war between all the text-examples during training that helps the final representations settle into a useful constellation-of-arrangements, with the desired relative-distance/relative-direction properties.

148

answered Oct 14 '22 18:10

gojomo

Related questions
                            
                                How is Growing Neural Gas used for clustering?
                            
                                Min-Max normalization Layer in Caffe
                            
                                Keras correct input shape for multilayer perceptron
                            
                                Panel data in Keras LSTM
                            
                                How to use max pooling to gather information from LSTM nodes
                            
                                Threading in tensorflow's input pipeline
                            
                                Why is a CNN slower to train than a fully connected MLP in Keras?
                            
                                Tensorflow Autoencoder - How To Calculate Reconstruction Error?
                            
                                Keras simple RNN implementation
                            
                                keras combine pretrained model
                            
                                pytorch variable index lost one dimension
                            
                                Advanced Activation layers in Keras Functional API
                            
                                questions on clustering methods
                            
                                Are neural networks really abandonware?
                            
                                how to write a matlab code for a pattern recognition in neural network
                            
                                Cannot train a neural network solving XOR mapping
                            
                                LSTM implementation with peephole
                            
                                What layers should experience "dropout" when training a Neural Network?
                            
                                Save or export weights and biases in TensorFlow for non-Python replication
                            
                                Dimensions in convolutional neural network

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How much data is actually required to train a doc2Vec model?

Tags:

neural-network

gensim

doc2vec

Shalabh Singh

People also ask

1 Answers

gojomo

Recent Activity

Donate For Us