I tried to apply doc2vec on 600000 rows of sentences: Code as below:
from gensim import models
model = models.Doc2Vec(alpha=.025, min_alpha=.025, min_count=1, workers = 5)
model.build_vocab(res)
token_count = sum([len(sentence) for sentence in res])
token_count
%%time
for epoch in range(100):
#print ('iteration:'+str(epoch+1))
#model.train(sentences)
model.train(res, total_examples = token_count,epochs = model.iter)
model.alpha -= 0.0001 # decrease the learning rate`
model.min_alpha = model.alpha # fix the learning rate, no decay
I am getting very poor results with the above implementation. the change I made apart from what was suggested in the tutorial was change the below line:
model.train(sentences)
As:
token_count = sum([len(sentence) for sentence in res])
model.train(res, total_examples = token_count,epochs = model.iter)
min_count (int, optional) – Ignores all words with total frequency lower than this. max_vocab_size (int, optional) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.
Unfortunately, your code's a nonsensical mix of misguided practices. so don't follow whatever online example you're following!
Taking the problems in order, top to bottom:
Don't make min_alpha
the same as alpha
. The stochastic-gradient-descent optimization process needs a gradual decline from a larger to smaller alpha
learning-rate over the course of seeing many varied examples, and should generally end with a negligible near-zero value. (There are other problems with the code's attempt to explicitly decrement alpha
in this way we'll get to below.) Only expert users who've already got a working setup, understand the algorithm well, and are performing experimental tweaks should be changing the alpha
/min_alpha
defaults.
Don't set min_count=1
. Rare words that only appear once, or a few times, are generally not helpful for Word2Vec/Doc2Vec training. Their few occurrences mean their own corresponding model weights don't get much training, and the few occurrences are more likely to be unrepresentative compared to the corresponding words' true meaning (as might be reflected in test data or later production data). So the model's representations of these individual rare words are unlikely to become very good. But in total, all those rare words compete a lot with other words that do have a chance to become meaningful – so the 'rough' rare words are mainly random interference against other words. Or perhaps, those words mean extra model vocabulary parameters which help the model become superficially better on training data, due to memorizing non-generalizable idiosyncrasies there, but worse on future test/production data. So, min_count
is another default (5) that should only be changed once you have a working baseline - and if you rigorously meta-optimize this parameter later, on a good-sized dataset (like your 600K docs), you're quite likely to find that a higher min_count
rather than lower improves final results.
Why make a token_count
? There's no later place where a total token-count is needed. The total_examples
parameter later expects a count of the text examples – that is, number of individual documents/sentences – not total words. By supplying the (much-larger) word-count, train()
wouldn't be able to manage alpha
correctly or estimate progress in logged-output.
Don't call train()
multiple times in a loop with your own explicit alpha
management, unless you're positive you know what you're doing. Most people get it wrong. By supplying the default model.iter
(which has a value of 5) as a parameter here, you're actually performing 500 total passes over your corpus, which is unlikely what you want. By decrementing an initial 0.025 alpha
value by 0.0001 over 100 loops, you're winding up with a final alpha
of 0.015 - less than half the starting value. Instead, call train()
exactly once, with a correct total_examples
, and a well-chosen epochs
value (often 10 to 20 are used in Doc2Vec published work). Then it will do the exact right number of explicit iterations, and manage alpha
intelligently, and print accurate progress estimation in logging.
Finally, this next thing isn't necessarily a problem in your code, because you don't show how your corpus res
is constructed, but there is a common error to beware: make sure your corpus can be iterated over multiple times (as if it were an in-memory list, or a restartable iterable object over something coming from IO). Often people supply a single-use iterator, which after one pass through (as in the build_vocab()
) returns nothing else - resulting in instant training and a uselessly-still-random-and-untrained model. (If you've enabled logging, and pay attention to logged output and timing of each step, it'll be obvious if this is a problem.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With