I am using 24 cores virtual CPU and 100G memory to training Doc2Vec with Gensim, but the usage of CPU always is around 200% whatever to modify the number of cores.
top
htop
The above two pictures showed the percentage of cpu usage, this pointed out that cpu wasn't used efficiently.
cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"
simple_models = [
# PV-DBOW plain
Doc2Vec(dm=0, vector_size=100, negative=5, hs=0, min_count=2, sample=0,
epochs=20, workers=cores),
# PV-DM w/ default averaging; a higher starting alpha may improve CBOW/PV-DM modes
Doc2Vec(dm=1, vector_size=100, window=10, negative=5, hs=0, min_count=2, sample=0,
epochs=20, workers=cores, alpha=0.05, comment='alpha=0.05'),
# PV-DM w/ concatenation - big, slow, experimental mode
# window=5 (both sides) approximates paper's apparent 10-word total window size
Doc2Vec(dm=1, dm_concat=1, vector_size=100, window=5, negative=5, hs=0, min_count=2, sample=0,
epochs=20, workers=cores),
]
for model in simple_models:
model.build_vocab(all_x_w2v)
print("%s vocabulary scanned & state initialized" % model)
models_by_name = OrderedDict((str(model), model) for model in simple_models)
Edit:
I tried to use parameter corpus_file instead of documents, and resolved above problem. but, I need to adjust the code and convert all_x_w2v to file, and all_x_w2v didn't directly do this.
The Python Global Interpreter Lock ("GIL") and other interthread-bottlenecks prevent its code from saturating all CPU cores with the classic gensim Word2Vec
/Doc2Vec
/etc flexible corpus-iterators – where you can supply any re-iterable sequence of the texts.
You can improve the throughput a bit with steps like:
larger values of negative
, size
, & window
avoiding any complicated steps (like tokenization) in your iterator – ideally it will just be streaming from a simple on-disk format
experimenting with different worker
counts – the optimal count will vary based on your other parameters & system details, but is often in the 3-12 range (no matter how many more cores you have)
Additionally, recent versions of gensim
offer an alternative corpus-specification method: a corpus_file
pointer to an already space-delimited, text-per-line file. If you supply your texts this way, multiple threads will each read the raw file in optimized code – and it's possible to achieve much higher CPU utilization. However, in this mode you lose the ability to specify your own document tags
, or more than one tag
per document. (The documents will just be given unique IDs based on their line-number in the file.)
See the docs for Doc2Vec
, and its parameter corpus_file
:
https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.Doc2Vec
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With