How are number of iterations and number of partitions releated in Apache spark Word2Vec?

Question

According to mllib.feature.Word2Vec - spark 1.3.1 documentation [1]:

def setNumIterations(numIterations: Int): Word2Vec.this.type

Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions.

def setNumPartitions(numPartitions: Int): Word2Vec.this.type

Sets number of partitions (default: 1). Use a small number for accuracy.

But in this Pull Request [2]:

To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed.

Questions:

How do the parameters numIterations & numPartitions effect the internal working of the algorithm?
Is there a trade-off between setting the number of partitions and number of iterations considering the following rules ?
- more accuracy -> more iteration a/c to [2]
- more iteration -> more partition a/c to [1]
- more partition -> less accuracy

rennerj2 · Accepted Answer

When increasing the number of partitions, you decrease the amount of data each partition is trained on, thus making each training step (word vector adjustment) more "noisy" and less sure. Spark's implementation responds to this by decreasing the learning rate when you increase the number of partitions, since there are more processes updating the vector weights.

How are number of iterations and number of partitions releated in Apache spark Word2Vec?

Tags:

apache-spark

word2vec

apache-spark-mllib

Arshiyan Alam

1 Answers

rennerj2

Recent Activity

Donate For Us

How are number of iterations and number of partitions releated in Apache spark Word2Vec?

Tags:

apache-spark

word2vec

apache-spark-mllib

Arshiyan Alam

1 Answers

rennerj2

Related questions

Recent Activity

Donate For Us