How many epochs should Word2Vec be trained? What is a recommended training dataset?

Tags:

I am learning about Word2Vec using the TensorFlow tutorial. The code I am running for Word2Vec is also from the TensorFlow tutorial: https://github.com/tensorflow/models/blob/master/tutorials/embedding/word2vec_optimized.py . When I ran the code for 15 epochs, the test accuracy was around 30%. When I ran for 100 epochs, test accuracy got up to around 39%. I am using the Text8 dataset for training and questions-words.txt for evaluation.

Do I need to run for more epochs? Should I be using a different dataset? How can I improve test accuracy?

751

asked Oct 20 '17 20:10

A. Buhman

1 Answers

Larger datasets are better; text8 is very, very small – sufficient for showing some of the analogy-solving power of word-vectors, but not good enough for other purposes.

More iterations may help squeeze slightly stronger vectors out of smaller datasets, but with diminishing returns. (No number of extra iterations over a weak dataset can extract the same rich interrelationships that a larger, more varied corpus can provide.)

There's a related text9 from the same source that if I recall correctly, is 10x larger. You'll likely get better evaluation results from using it, than from doing 10x more iterations on text8.

I believe the 3 million pretrained vectors Google once released – the GoogleNews set – were trained on a corpus of 100 billion words' worth of news articles, but with just 3 passes.

Note that there's no single standard for word-vector quality: the questions-words.txt analogy solving is just one convenient evaluation, but it's possible the word-vectors best at that won't be best at your own domain-specific analyses. Similarly, word-vectors trained on one domain of text, like the GoogleNews set from news articles, might underperform compared to text that better matches your domain (like perhaps forum posts, scientific articles, etc. – which all use different words in different ways).

So it's often best to use your own corpus, and your own goal-specific quantitative evaluation, to help adjust corpus/parameter choices.

123

answered Oct 19 '22 09:10

gojomo

Related questions
                            
                                Keras GaussianNoise layer no effect?
                            
                                TypeError: Tensor is unhashable if Tensor equality is enabled. Instead, use tensor.experimental_ref() as the key
                            
                                Access deprecated attribute "validation_data" in tf.keras.callbacks.Callback
                            
                                How to Get Reproducible Results (Keras, Tensorflow):
                            
                                How to reset Keras metrics?
                            
                                Tensorflow 2.2.0 error: [Predictions must be > 0] [Condition x >= y did not hold element-wise:] while using Bidirectional LSTM layer
                            
                                OpenAI GPT-2 model use with TensorFlow JS
                            
                                TensorFlow - why doesn't this sofmax regression learn anything?
                            
                                Tensorflow not using GPU
                            
                                Implementing a many-to-many LSTM in TensorFlow?
                            
                                Tensorflow: why is zip() function used in the steps involving applying the gradients?
                            
                                How does tf.train.batch create a batch
                            
                                3D Convolutional Neural Network input shape
                            
                                How to run Tensorboard and jupyter concurrently with docker?
                            
                                Resize 3D data in tensorflow like tf.image.resize_images
                            
                                How to load checkpoint and inference with C++ for tensorflow?
                            
                                Stateful LSTM fails to predict due to batch_size issue
                            
                                legacy_init_op in TensorFlow Serving
                            
                                tf.contrib.data.Dataset seems does not support SparseTensor
                            
                                tf.GraphKeys.TRAINABLE_VARIABLES on output_graph.pb resulting in empty list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How many epochs should Word2Vec be trained? What is a recommended training dataset?

Tags:

tensorflow

word2vec

A. Buhman

People also ask

1 Answers

gojomo

Recent Activity

Donate For Us