According to mllib.feature.Word2Vec - spark 1.3.1 documentation [1]:
def setNumIterations(numIterations: Int): Word2Vec.this.type
Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions.
def setNumPartitions(numPartitions: Int): Word2Vec.this.type
Sets number of partitions (default: 1). Use a small number for accuracy.
But in this Pull Request [2]:
To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed.
Questions:
How do the parameters numIterations & numPartitions effect the internal working of the algorithm?
Is there a trade-off between setting the number of partitions and number of iterations considering the following rules ?
more accuracy -> more iteration a/c to [2]
more iteration -> more partition a/c to [1]
more partition -> less accuracy
When increasing the number of partitions, you decrease the amount of data each partition is trained on, thus making each training step (word vector adjustment) more "noisy" and less sure. Spark's implementation responds to this by decreasing the learning rate when you increase the number of partitions, since there are more processes updating the vector weights.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With