Below I have a Scala example of a Spark fold
action:
val rdd1 = sc.parallelize(List(1,2,3,4,5), 3)
rdd1.fold(5)(_ + _)
This produces the output 35
. Can somebody explain in detail how this output gets computed?
Fold in spark: Fold is a very powerful operation in spark which allows you to calculate many important values in O(n) time. If you are familiar with Scala collection it will be like using fold operation on a collection. Even if you not used fold in Scala, this post will make you comfortable in using fold.
fold calls fold on an iterator for each partition, then merges the results, reduce calls reduceLeft on the iterator for each partition then merges the result. The difference is that fold doesn't need to worry about empty partitions or collections, because then it will just use the zero value.
Actions. Transformations create RDDs from each other, but when we want to work with the actual dataset, at that point action is performed. When the action is triggered after the result, new RDD is not formed like transformation. Thus, Actions are Spark RDD operations that give non-RDD values.
In order for Spark to become a leader in computational speed, it needed to incorporate operational parallelism. Parallelism will ultimately be the reason foldLeft is not found on the RDD class.
Taken from the Scaladocs here (emphasis mine):
@param zeroValue the initial value for the accumulated result of each partition for the
op
operator, and also the initial value for the combine results from different partitions for theop
operator - this will typically be the neutral element (e.g.Nil
for list concatenation or0
for summation)
The zeroValue
is in your case added four times (one for each partition, plus one when combining the results from the partitions). So the result is:
(5 + 1) + (5 + 2 + 3) + (5 + 4 + 5) + 5 // (extra one for combining results)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With