Unexpected results in Spark MapReduce

Question

I'm new to Spark and want to understand how MapReduce gets done under the hood to ensure I use it properly. This post provided a great answer, but my results don't seem to follow the logic described. I'm running the Spark Quick Start guide in Scala on command line. When I do line length addition properly, things come out just fine. Total line length is 1213:

scala> val textFile = sc.textFile("README.md")

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))

scala> val linesWithSparkLengths = linesWithSpark.map(s => s.length)

scala> linesWithSparkLengths.foreach(println)

Result:
14
78
73
42
68
17
62
45
76
64
54
74
84
29
136
77
77
73
70

scala> val totalLWSparkLength = linesWithSparkLengths.reduce((a,b) => a+b)
    totalLWSparkLength: Int = 1213

When I tweak it slightly to use (a-b) instead of (a+b),

scala> val totalLWSparkTest = linesWithSparkLengths.reduce((a,b) => a-b)

I expected -1185, according to the logic in this post:

List(14,78,73,42,68,17,62,45,76,64,54,74,84,29,136,77,77,73,70).reduce( (x,y) => x - y )
  Step 1 : op( 14, 78 ) will be the first evaluation. 
     x is 14 and y is 78. Result of x - y = -64.
  Step 2:  op( op( 14, 78 ), 73 )
     x is op(14,78) = -64 and y = 73. Result of x - y = -137
  Step 3:  op( op( op( 14, 78 ), 73 ), 42) 
     x is op( op( 14, 78 ), 73 ) = -137 and y is 42. Result is -179.
  ...
  Step 18:  op( (... ), 73), 70) will be the final evaluation.
     x is -1115 and y is 70. Result of x - y is -1185.

However, something strange happens:

scala> val totalLWSparkTest = linesWithSparkLengths.reduce((a,b) => a-b)
totalLWSparkTest: Int = 151

When I run it again...

scala> val totalLWSparkTest = linesWithSparkLengths.reduce((a,b) => a-b)
totalLWSparkTest: Int = -151

Can anyone tell me why the result is 151 (or -151) instead of -1185?

zero323 · Accepted Answer

It happens because subtraction is neither associative nor commutative. Lets start with associativity:

(- (- (- 14 78) 73) 42) 
(- (- -64 73) 42)
(- -137 42) 
-179

is not the same as

(- (- 14 78) (- 73 42))
(- -64 (- 73 42))
(- -64 31)
-95

Now its time for commutativity:

(- (- (- 14 78) 73) 42) ;; From the previous example

is not the same as

(- (- (- 42 73) 78) 14)
(- (- -31 78) 14)
(- -109 14)
-123

Spark first applies reduce on individual partitions and then merges partial results in arbitrary order. If function you use doesn't meet one or both criteria final results can be non-deterministic.

Unexpected results in Spark MapReduce

Tags:

scala

apache-spark

mapreduce

Bo Rankin

1 Answers

zero323

Recent Activity

Donate For Us

Unexpected results in Spark MapReduce

Tags:

scala

apache-spark

mapreduce

Bo Rankin

1 Answers

zero323

Related questions

Recent Activity

Donate For Us