Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gensim Word2Vec model getting worse by increasing the number of epochs

I'm building a Word2Vec model for a category-recommendation on a dataset consisting of ~35.000 sentences for a total of ~500.000 words but only ~3.000 distinct ones. I build the model basically like this :

def train_w2v_model(df, epochs):
    w2v_model = Word2Vec(min_count=5,
                                 window=100,
                                 size=230,
                                 sample=0,
                                 workers=cores-1,
                                 batch_words=100)
    vocab = df['sentences'].apply(list)
    w2v_model.build_vocab(vocab)
    w2v_model.train(vocab, total_examples=w2v_model.corpus_count, total_words=w2v_model.corpus_total_words, epochs=epochs, compute_loss=True)
    return w2v_model.get_latest_training_loss()

I tried to find the right number of epochs for such a model like this :

print(train_w2v_model(1))
=>> 86898.2109375
print(train_w2v_model(100))
=>> 5025273.0

I find the results very counterintuitive. I do not understand how increasing the number of epochs could lead to lower the performance. It seems not to be a misunderstanding from the function get_latest_training_loss since I observe the results with the function most_similar way better with only 1 epoch :

100 epochs :

w2v_model.wv.most_similar(['machine_learning'])
=>> [('salesforce', 0.3464601933956146),
 ('marketing_relationnel', 0.3125850558280945),
 ('batiment', 0.30903393030166626),
 ('go', 0.29414454102516174),
 ('simulation', 0.2930642068386078),
 ('data_management', 0.28968319296836853),
 ('scraping', 0.28260597586631775),
 ('virtualisation', 0.27560457587242126),
 ('dataviz', 0.26913416385650635),
 ('pandas', 0.2685554623603821)]

1 epoch :

w2v_model.wv.most_similar(['machine_learning'])
=>> [('data_science', 0.9953729510307312),
 ('data_mining', 0.9930223822593689),
 ('big_data', 0.9894922375679016),
 ('spark', 0.9881765842437744),
 ('nlp', 0.9879133701324463),
 ('hadoop', 0.9834049344062805),
 ('deep_learning', 0.9831978678703308),
 ('r', 0.9827396273612976),
 ('data_visualisation', 0.9805369973182678),
 ('nltk', 0.9800992012023926)]

Any insight on why it behaves like this ? I would have think that increasing the number of epochs would have for sure a positive effect on the training loss.

like image 254
Thibaut Loiseleur Avatar asked Oct 01 '19 14:10

Thibaut Loiseleur


People also ask

How long does it take to train Word2Vec?

To train a Word2Vec model takes about 22 hours, and FastText model takes about 33 hours. If it's too long to you, you can use fewer "iter", but the performance might be worse.

Is Gensim Word2Vec CBOW or skip gram?

There are two main training algorithms for word2vec, one is the continuous bag of words(CBOW), another is called skip-gram. The major difference between these two methods is that CBOW is using context to predict a target word while skip-gram is using a word to predict a target context.

How accurate is Word2Vec?

According to the results of tests of the accuracy of the three word embedding, FastText outperforms Glove and Word2vec for the dataset of 20 newsgroups, the accuracy is 97.2% for FastText, 95.8% for Glove and 92.5% for Word2Vec.

What is Gensim Word2Vec trained on?

The pre-trained Google word2vec model was trained on Google news data (about 100 billion words); it contains 3 million words and phrases and was fit using 300-dimensional word vectors. It is a 1.53 Gigabytes file. You can download it from here: GoogleNews-vectors-negative300.


1 Answers

First, the reporting of running training loss is, at least through gensim-3.8.1 (September 2019), a bit of a half-baked feature. It's just a running sum of all loss across all epochs - and thus always increasing – rather than a per-epoch loss that could decrease. There's a long-pending fix that still needs a little work, but until that's added, reported numbers need extra comparison to earlier-numbers to detect any epoch-to-epoch decrease.

But also note: even the per-epoch loss isn't a direct measure of model quality/performance for outside purposes. It's only an indicator whether training is still helping on its internal optimization task.

Second, if in fact the model is getting worse at an external evaluation – like whether the most_similar() results match human estimations – with more training, then that's often an indication the overfitting is occurring. That is, the (possibly oversized) model is memorizing features of the (likely undersized) training data, and thus getting better at its internal optimization goals in ways that no longer generalize to the larger world of interest.

500K total words is fairly small for a word2vec training set, but could be serviceable if you're also intending to only train a smallish-vocabulary (so there are still many varied examples of each word), and using smallish-dimensioned vectors.

It's not clear what your calculated min_count is, but note that increasing it, by shrinking the model, can fight overfitting. But also note that any words appearing fewer times than that threshold will be completely ignored during training, making your effective training data size smaller.

Similarly, it's not clear what embedding_size you're using, but trying to make "very large" vectors for a small vocabulary is very overfitting-prone, as there's plenty of "room" for vectors to memorize training details. Smaller vectors force a sort of "compression" on the model that results in learning that's more likely to generalize. (My very-rough rule-of-thumb, not at all really rigorously established, is to never use an dense embedding size larger than the square-root of the expected vocabulary size. So, with 10K tokens, no more than 100-dimension vectors.)

Other observations, probably unrelated to your problem, but perhaps of interest to your goals:

  • window=100 is atypical, and seems far larger than your average text size (~14 words) – if the intent is that all tokens should affect all others, without regard to distance (perhaps because the source data is inherently unordered), that's appropriate, and you might as well go far larger (say 1 million). On the other hand, if a token's direct neighbors are more relevant than others in the same text, a smaller window would make sense.

  • there's no good reason to use batch_words=100 – it will only slow training, and if in fact you have any texts that are larger than 100 words, it will artificially break them up (thus undoing any benefit from the giant window value above). (Leave this as the default.)

  • the train() method will only use total_examples, or total_words, but not both – so you only need to specify one

  • as it appears you might be working with something more like category-recommendation than pure natural language, you may also want to tinker with non-default values of the ns_exponent parameter – check the doc-comment & referenced paper in the class docs about that parameter for more details.

like image 176
gojomo Avatar answered Sep 19 '22 08:09

gojomo