How to print accumulator variable from within task (seem to "work" without calling value method)?

Tags:

I know the accumulator variables are 'write only' from the point of view of tasks, when they are in execution in worker nodes. I was doing some testing on this and I realized that I am able to print the accumulator value in the task.

Here I am initializing the accumulator in the driver:-

scala> val accum  = sc.accumulator(123)
accum: org.apache.spark.Accumulator[Int] = 123

Then I go on to define a function 'foo':-

scala> def foo(pair:(String,String)) = { println(accum); pair }
foo: (pair: (String, String))(String, String)

In this function I am simply printing the accumulator and then I return the same pair that was received.

Now I have an RDD called myrdd with the following type:-

scala> myrdd
res13: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[9] at map at <console>:21

And I am now calling the map transformation on this RDD:-

myrdd.map(foo).collect

The 'collect' action is being applied to force evaluation. So what actually happens here is that during this execution a zero (0) is printed for every line of the RDD. Since this RDD has 4 elements, it prints 0 4 times. Since the action 'collect' is there , it also prints all the elements in the end, but that's not really the focus here. So I have two questions:-

Logically, printing equivalent to reading, because only when you can read, can you print. So why is this allowed? Why was the exception not thrown something that would definitely happen if we try to 'return' the accumulator in the function)?
Why is it printing 0 as the value of the accumulator, when we had initiated it as 123 in the driver?

After some experimentation I found that if I change function definition to access the actual value property of the accumulator object (accum.value), and then trigger the RDD action as described already, it does indeed throw the exception:-

scala> def foo(pair:(String,String)) = { println(accum.value); pair }

The exception caused during the RDD evaluation:-

Can't read accumulator value in the task

So what I was doing earlier is trying to print the accumulator object itself. But the question still remains as to why it printed 0? Because at driver level if I issue the same command that I used in the function definition, I do indeed get the value 123:-

scala> println(accum)
123

I didn't have to say println(accum.value) for it to work. So why only, when I issue this command in the function which the task uses, does it print 0?

628

asked Jul 19 '15 01:07

Dhiraj

1 Answers

Why is it printing 0 as the value of the accumulator, when we had initiated it as 123 in the driver?

Because worker nodes will never see initial value. Only thing that is passed to workers is zero, as defined in AccumulatorParam. For Accumulator[Int] it is simply 0. If you first update an accumulator you'll see updated local value:

val acc = sc.accumulator(123)
val rdd = sc.parallelize(List(1, 2, 3))
rdd.foreach(i => {acc += i; println(acc)})

It is even clearer when you use a single partition:

rdd.repartition(1).foreach(i => {acc += i; println(acc)}

Why was the exception not thrown (...)?

Because exception is thrown when you access value method, and toString is not using it at all. Instead it is using private value_ variable, the same one which is returned by value if !deserialized check passed.

128

answered Oct 20 '22 14:10

zero323

Related questions
                            
                                Scala, Play, Akka, Websocket: how to pass actor messages through websocket
                            
                                slick exception when trying to connect to MySql
                            
                                Either, Try, and Validation in Scala
                            
                                How to use ScriptEngine in ScalaTest
                            
                                How can I do a verbose compile in Play Framework?
                            
                                calculating first quartile for a numeric column in spark
                            
                                How can I create a TF-IDF for Text Classification using Spark?
                            
                                How to fail a Gatling test from within "exec"?
                            
                                Why is Scala's Symbol not accepted as a column reference?
                            
                                Currying in Scala: multiple parameter lists vs returning a function
                            
                                Why is TraversableOnce.toSeq returning a Stream?
                            
                                Scala: Enforce compile error on type alias mismatch
                            
                                Scala error handling: Try or Either?
                            
                                is it safe to close over the actor context in akka scheduler
                            
                                Does scala offer async non-blocking IO when working with files?
                            
                                Scala: Access optional value in optional object
                            
                                How can I merge spark results files without repartition and copyMerge?
                            
                                How to configure code style for Scala in IntelliJ IDEA
                            
                                Zeppelin SqlContext registerTempTable issue
                            
                                Scala uses mutable variables to implement its apis

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to print accumulator variable from within task (seem to "work" without calling value method)?

Tags:

scala

apache-spark

rdd

Dhiraj

People also ask

1 Answers

zero323

Recent Activity

Donate For Us