I am learning about Word2Vec using the TensorFlow tutorial. The code I am running for Word2Vec is also from the TensorFlow tutorial: https://github.com/tensorflow/models/blob/master/tutorials/embedding/word2vec_optimized.py . When I ran the code for 15 epochs, the test accuracy was around 30%. When I ran for 100 epochs, test accuracy got up to around 39%. I am using the Text8 dataset for training and questions-words.txt for evaluation.
Do I need to run for more epochs? Should I be using a different dataset? How can I improve test accuracy?
Content. It's 1.5GB! It includes word vectors for a vocabulary of 3 million words and phrases that they trained on roughly 100 billion words from a Google News dataset.
To train a Word2Vec model takes about 22 hours, and FastText model takes about 33 hours. If it's too long to you, you can use fewer "iter", but the performance might be worse.
Training the networkwe take a training sample and generate the output value of the nework. we evaluate the loss by comparing the model prediction with the true output label. we update weights of the network by using gradient descent technique on the evaluated loss. we then take another sample and start over again.
To prepare the dataset for training a word2vec model, flatten the dataset into a list of sentence vector sequences. This step is required as you would iterate over each sentence in the dataset to produce positive and negative examples.
Larger datasets are better; text8
is very, very small – sufficient for showing some of the analogy-solving power of word-vectors, but not good enough for other purposes.
More iterations may help squeeze slightly stronger vectors out of smaller datasets, but with diminishing returns. (No number of extra iterations over a weak dataset can extract the same rich interrelationships that a larger, more varied corpus can provide.)
There's a related text9
from the same source that if I recall correctly, is 10x larger. You'll likely get better evaluation results from using it, than from doing 10x more iterations on text8
.
I believe the 3 million pretrained vectors Google once released – the GoogleNews
set – were trained on a corpus of 100 billion words' worth of news articles, but with just 3 passes.
Note that there's no single standard for word-vector quality: the questions-words.txt
analogy solving is just one convenient evaluation, but it's possible the word-vectors best at that won't be best at your own domain-specific analyses. Similarly, word-vectors trained on one domain of text, like the GoogleNews
set from news articles, might underperform compared to text that better matches your domain (like perhaps forum posts, scientific articles, etc. – which all use different words in different ways).
So it's often best to use your own corpus, and your own goal-specific quantitative evaluation, to help adjust corpus/parameter choices.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With