Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the fold action work in Spark?

Below I have a Scala example of a Spark fold action:

val rdd1 = sc.parallelize(List(1,2,3,4,5), 3)
rdd1.fold(5)(_ + _)

This produces the output 35. Can somebody explain in detail how this output gets computed?

like image 991
thedevd Avatar asked Jan 20 '18 16:01

thedevd


People also ask

What is fold in Spark?

Fold in spark: Fold is a very powerful operation in spark which allows you to calculate many important values in O(n) time. If you are familiar with Scala collection it will be like using fold operation on a collection. Even if you not used fold in Scala, this post will make you comfortable in using fold.

What is the difference between reduce and fold in Spark?

fold calls fold on an iterator for each partition, then merges the results, reduce calls reduceLeft on the iterator for each partition then merges the result. The difference is that fold doesn't need to worry about empty partitions or collections, because then it will just use the zero value.

How does Spark apply an action operation?

Actions. Transformations create RDDs from each other, but when we want to work with the actual dataset, at that point action is performed. When the action is triggered after the result, new RDD is not formed like transformation. Thus, Actions are Spark RDD operations that give non-RDD values.

Why fold left and fold right are not supported in Spark?

In order for Spark to become a leader in computational speed, it needed to incorporate operational parallelism. Parallelism will ultimately be the reason foldLeft is not found on the RDD class.


1 Answers

Taken from the Scaladocs here (emphasis mine):

@param zeroValue the initial value for the accumulated result of each partition for the op operator, and also the initial value for the combine results from different partitions for the op operator - this will typically be the neutral element (e.g. Nil for list concatenation or 0 for summation)

The zeroValue is in your case added four times (one for each partition, plus one when combining the results from the partitions). So the result is:

(5 + 1) + (5 + 2 + 3) + (5 + 4 + 5) + 5 // (extra one for combining results)
like image 96
SCouto Avatar answered Nov 16 '22 03:11

SCouto