I'm new to Spark and want to understand how MapReduce gets done under the hood to ensure I use it properly. This post provided a great answer, but my results don't seem to follow the logic described. I'm running the Spark Quick Start guide in Scala on command line. When I do line length addition properly, things come out just fine. Total line length is 1213:
scala> val textFile = sc.textFile("README.md")
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
scala> val linesWithSparkLengths = linesWithSpark.map(s => s.length)
scala> linesWithSparkLengths.foreach(println)
Result:
14
78
73
42
68
17
62
45
76
64
54
74
84
29
136
77
77
73
70
scala> val totalLWSparkLength = linesWithSparkLengths.reduce((a,b) => a+b)
totalLWSparkLength: Int = 1213
When I tweak it slightly to use (a-b) instead of (a+b),
scala> val totalLWSparkTest = linesWithSparkLengths.reduce((a,b) => a-b)
I expected -1185, according to the logic in this post:
List(14,78,73,42,68,17,62,45,76,64,54,74,84,29,136,77,77,73,70).reduce( (x,y) => x - y )
Step 1 : op( 14, 78 ) will be the first evaluation.
x is 14 and y is 78. Result of x - y = -64.
Step 2: op( op( 14, 78 ), 73 )
x is op(14,78) = -64 and y = 73. Result of x - y = -137
Step 3: op( op( op( 14, 78 ), 73 ), 42)
x is op( op( 14, 78 ), 73 ) = -137 and y is 42. Result is -179.
...
Step 18: op( (... ), 73), 70) will be the final evaluation.
x is -1115 and y is 70. Result of x - y is -1185.
However, something strange happens:
scala> val totalLWSparkTest = linesWithSparkLengths.reduce((a,b) => a-b)
totalLWSparkTest: Int = 151
When I run it again...
scala> val totalLWSparkTest = linesWithSparkLengths.reduce((a,b) => a-b)
totalLWSparkTest: Int = -151
Can anyone tell me why the result is 151 (or -151) instead of -1185?
It happens because subtraction is neither associative nor commutative. Lets start with associativity:
(- (- (- 14 78) 73) 42)
(- (- -64 73) 42)
(- -137 42)
-179
is not the same as
(- (- 14 78) (- 73 42))
(- -64 (- 73 42))
(- -64 31)
-95
Now its time for commutativity:
(- (- (- 14 78) 73) 42) ;; From the previous example
is not the same as
(- (- (- 42 73) 78) 14)
(- (- -31 78) 14)
(- -109 14)
-123
Spark first applies reduce
on individual partitions and then merges partial results in arbitrary order. If function you use doesn't meet one or both criteria final results can be non-deterministic.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With