Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unexpected results in Spark MapReduce

I'm new to Spark and want to understand how MapReduce gets done under the hood to ensure I use it properly. This post provided a great answer, but my results don't seem to follow the logic described. I'm running the Spark Quick Start guide in Scala on command line. When I do line length addition properly, things come out just fine. Total line length is 1213:

scala> val textFile = sc.textFile("README.md")

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))

scala> val linesWithSparkLengths = linesWithSpark.map(s => s.length)

scala> linesWithSparkLengths.foreach(println)

Result:
14
78
73
42
68
17
62
45
76
64
54
74
84
29
136
77
77
73
70

scala> val totalLWSparkLength = linesWithSparkLengths.reduce((a,b) => a+b)
    totalLWSparkLength: Int = 1213

When I tweak it slightly to use (a-b) instead of (a+b),

scala> val totalLWSparkTest = linesWithSparkLengths.reduce((a,b) => a-b)

I expected -1185, according to the logic in this post:

List(14,78,73,42,68,17,62,45,76,64,54,74,84,29,136,77,77,73,70).reduce( (x,y) => x - y )
  Step 1 : op( 14, 78 ) will be the first evaluation. 
     x is 14 and y is 78. Result of x - y = -64.
  Step 2:  op( op( 14, 78 ), 73 )
     x is op(14,78) = -64 and y = 73. Result of x - y = -137
  Step 3:  op( op( op( 14, 78 ), 73 ), 42) 
     x is op( op( 14, 78 ), 73 ) = -137 and y is 42. Result is -179.
  ...
  Step 18:  op( (... ), 73), 70) will be the final evaluation.
     x is -1115 and y is 70. Result of x - y is -1185.

However, something strange happens:

scala> val totalLWSparkTest = linesWithSparkLengths.reduce((a,b) => a-b)
totalLWSparkTest: Int = 151

When I run it again...

scala> val totalLWSparkTest = linesWithSparkLengths.reduce((a,b) => a-b)
totalLWSparkTest: Int = -151

Can anyone tell me why the result is 151 (or -151) instead of -1185?

like image 930
Bo Rankin Avatar asked Mar 13 '23 09:03

Bo Rankin


1 Answers

It happens because subtraction is neither associative nor commutative. Lets start with associativity:

(- (- (- 14 78) 73) 42) 
(- (- -64 73) 42)
(- -137 42) 
-179

is not the same as

(- (- 14 78) (- 73 42))
(- -64 (- 73 42))
(- -64 31)
-95

Now its time for commutativity:

(- (- (- 14 78) 73) 42) ;; From the previous example

is not the same as

(- (- (- 42 73) 78) 14)
(- (- -31 78) 14)
(- -109 14)
-123

Spark first applies reduce on individual partitions and then merges partial results in arbitrary order. If function you use doesn't meet one or both criteria final results can be non-deterministic.

like image 125
zero323 Avatar answered Mar 19 '23 18:03

zero323