Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word2Vec time complexity

I have googled this issue but I cannot find any reliable solution (some sources gives log(V) some log(V/2). But what is the time complexity of the word2vec model with the following parameters:

Word2Vec(corpus, size=4000, window=30, min_count=1, workers=50, iter=100, alpha=0.0001)

I have a vocabulary that equals to 10000 words (unique words).

like image 419
artona Avatar asked Sep 11 '25 16:09

artona


1 Answers

Without a formal analysis/proof, in practice and in the default 'negative sampling' case, execution time is chiefly determined by the size of the corpus, and grows roughly linearly with the size of the corpus. The number of unique words (vocabulary size V) isn't a major factor.

Gensim's implementation uses a binary-search over a vocabulary-sized array to achieve the sampling of negative examples, so its time complexity might technically be:

O(N * log(V))
  • where N is the total corpus size and
  • V is the unique-words vocabulary count.

But this particular O(log(V)) operation is, in practice, often faster than the memory-hungry O(1) sampling lookup used by the original Google/Mikolov word2vec.c – probably due to improved CPU cache efficiency.

So with the defaults:

  • If one corpus is twice as long, in words, than another, then it will take about twice as long to train the model on the larger corpus.
  • But if one corpus is the same size, in words, as another, but with a vocabulary twice as big, you probably won't notice much change in runtime.

In the non-default hierarchical softmax case – hs=1, negative=0 – words are encoded differently, and have longer encodings as the size of the vocabulary grows, and this increases the average number of training operations per corpus word – by a factor of log(V), I believe, so we again technically have a *O(N * log(V)) time-complexity.

But, this vocabulary-driven increase tends to be more significant, in practice, than the one inside negative-sampling's binary-search-based sampling.

So if you have two corpuses of the same length, but one has twice the number of unique words, you very well may notice a longer runtime, in hierarchical-softmax mode.

like image 154
gojomo Avatar answered Sep 13 '25 06:09

gojomo