Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How many epochs should Word2Vec be trained? What is a recommended training dataset?

I am learning about Word2Vec using the TensorFlow tutorial. The code I am running for Word2Vec is also from the TensorFlow tutorial: https://github.com/tensorflow/models/blob/master/tutorials/embedding/word2vec_optimized.py . When I ran the code for 15 epochs, the test accuracy was around 30%. When I ran for 100 epochs, test accuracy got up to around 39%. I am using the Text8 dataset for training and questions-words.txt for evaluation.

Do I need to run for more epochs? Should I be using a different dataset? How can I improve test accuracy?

like image 751
A. Buhman Avatar asked Oct 20 '17 20:10

A. Buhman


People also ask

What dataset is Word2Vec trained on?

Content. It's 1.5GB! It includes word vectors for a vocabulary of 3 million words and phrases that they trained on roughly 100 billion words from a Google News dataset.

How long does it take to train Word2Vec?

To train a Word2Vec model takes about 22 hours, and FastText model takes about 33 hours. If it's too long to you, you can use fewer "iter", but the performance might be worse.

How do you train a Word2Vec?

Training the networkwe take a training sample and generate the output value of the nework. we evaluate the loss by comparing the model prediction with the true output label. we update weights of the network by using gradient descent technique on the evaluated loss. we then take another sample and start over again.

How do I prepare data for Word2Vec?

To prepare the dataset for training a word2vec model, flatten the dataset into a list of sentence vector sequences. This step is required as you would iterate over each sentence in the dataset to produce positive and negative examples.


1 Answers

Larger datasets are better; text8 is very, very small – sufficient for showing some of the analogy-solving power of word-vectors, but not good enough for other purposes.

More iterations may help squeeze slightly stronger vectors out of smaller datasets, but with diminishing returns. (No number of extra iterations over a weak dataset can extract the same rich interrelationships that a larger, more varied corpus can provide.)

There's a related text9 from the same source that if I recall correctly, is 10x larger. You'll likely get better evaluation results from using it, than from doing 10x more iterations on text8.

I believe the 3 million pretrained vectors Google once released – the GoogleNews set – were trained on a corpus of 100 billion words' worth of news articles, but with just 3 passes.

Note that there's no single standard for word-vector quality: the questions-words.txt analogy solving is just one convenient evaluation, but it's possible the word-vectors best at that won't be best at your own domain-specific analyses. Similarly, word-vectors trained on one domain of text, like the GoogleNews set from news articles, might underperform compared to text that better matches your domain (like perhaps forum posts, scientific articles, etc. – which all use different words in different ways).

So it's often best to use your own corpus, and your own goal-specific quantitative evaluation, to help adjust corpus/parameter choices.

like image 123
gojomo Avatar answered Oct 19 '22 09:10

gojomo