I am practicing Apache Spark but encountering the following problem.
val accum = sc.accumulator( 0, "My Accumulator.")
println (accum) // print out: 0
sc.parallelize( Array(1, 2, 3, 4, 5) ).foreach( x => accum += x )
// sc.parallelize( Array(1, 2, 3, 4, 5) ).foreach( x => accum = accum + x )
println( accum.value ) // print out: 15
This line of code sc.parallelize( Array(1, 2, 3, 4, 5) ).foreach( x => accum += x )
is working quite well, but the code commented out below it is not working. The difference is:
x => accum += x
and
x => accum = accum + x
Why the second one is not working?
An accumulator is created from an initial value v by calling SparkContext. accumulator(v). Tasks running on the cluster can then add to it using the add method or the += operator (in Scala and Python). However, they cannot read its value.
Accumulators are read-only shared variables provided by Spark. Accumulators are only "added" to through an associative and commutative operation and can be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums.
The Accumulator in PySpark programming can be created using the "accumulator()" function from the SparkContext class and also Accumulators can be created for custom types using the AccumulatorParam class of the PySpark. The "sparkContext. accumulator()" is used to define the accumulator variables.
To answer the question "When are accumulators truly reliable ?" Answer : When they are present in an Action operation. As per the documentation in Action Task, even if any restarted tasks are present it will update Accumulator only once.
There are three reasons why it doesn't work:
accum
is a value so it cannot be reassigned Accumulable
class, which is a base class for Accumulator
provides only +=
method, not +
+
method could modify accum
in place, but it would be rather confusing. Because believing it or not, an Accumulator
in Apache Spark
works like a write-only global variable, in our imperative thinking we don't see any difference between x += 1
and x = x + 1
, but there is a slight difference, the second operation in Apache Spark
would require to read the value, but the first one wouldn't, or in an easier (how zero said in his explanation) the method +
isn't implemented for that class. Apache Spark on p. 41, you can read about how it works, the slides are extracted from the Introduction to Big Data with Apache Spark
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With