Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How are number of iterations and number of partitions releated in Apache spark Word2Vec?

According to mllib.feature.Word2Vec - spark 1.3.1 documentation [1]:

def setNumIterations(numIterations: Int): Word2Vec.this.type

Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions.

def setNumPartitions(numPartitions: Int): Word2Vec.this.type

Sets number of partitions (default: 1). Use a small number for accuracy.

But in this Pull Request [2]:

To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed.

Questions:

  • How do the parameters numIterations & numPartitions effect the internal working of the algorithm?

  • Is there a trade-off between setting the number of partitions and number of iterations considering the following rules ?

    • more accuracy -> more iteration a/c to [2]

    • more iteration -> more partition a/c to [1]

    • more partition -> less accuracy

like image 604
Arshiyan Alam Avatar asked Jun 02 '16 04:06

Arshiyan Alam


1 Answers

When increasing the number of partitions, you decrease the amount of data each partition is trained on, thus making each training step (word vector adjustment) more "noisy" and less sure. Spark's implementation responds to this by decreasing the learning rate when you increase the number of partitions, since there are more processes updating the vector weights.

like image 144
rennerj2 Avatar answered Nov 06 '22 07:11

rennerj2