Accumulator fails on cluster, works locally

Tags:

In the official spark documentation, there is an example for an accumulator which is used in a foreach call which is directly on an RDD:

scala> val accum = sc.accumulator(0)
accum: spark.Accumulator[Int] = 0

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s

scala> accum.value
res2: Int = 10

I implemented my own accumulator:

val myCounter = sc.accumulator(0)

val myRDD = sc.textFile(inputpath) // :spark.RDD[String]

myRDD.flatMap(line => foo(line)) // line 69

def foo(line: String) = {
   myCounter += 1  // line 82 throwing NullPointerException
   // compute something on the input
}
println(myCounter.value)

In a local setting, this works just fine. However, if I run this job on a spark standalone cluster with several machines, the workers throw a

13/07/22 21:56:09 ERROR executor.Executor: Exception in task ID 247
java.lang.NullPointerException
    at MyClass$.foo(MyClass.scala:82)
    at MyClass$$anonfun$2.apply(MyClass.scala:67)
    at MyClass$$anonfun$2.apply(MyClass.scala:67)
    at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:400)
    at spark.PairRDDFunctions.writeToFile$1(PairRDDFunctions.scala:630)
    at spark.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:640)
    at spark.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:640)
    at spark.scheduler.ResultTask.run(ResultTask.scala:77)
    at spark.executor.Executor$TaskRunner.run(Executor.scala:98)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)

at the line which increments the accumulator myCounter.

My question is: Can accumulators only be used in "top-level" anonymous functions which are applied directly to RDDs and not in nested functions? If yes, why does my call succeed locally and fail on a cluster?

edit: increased verbosity of exception.

751

asked Jul 22 '13 18:07

ptikobj

1 Answers

In my case too, accumulator was null in closure when I used 'extends App' to create a spark application as shown below

    object AccTest extends App {


    val conf = new SparkConf().setAppName("AccTest").setMaster("yarn-client")
    val sc = new SparkContext(conf)
    sc.setLogLevel("ERROR")

    val accum = sc.accumulator(0, "My Accumulator")
    sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)

    println("count:" + accum.value)

    sc.stop
  }
}

I replaced extends App with main() method and it worked in YARN cluster in HDP 2.4

object AccTest {

    def main(args: Array[String]): Unit = {

        val conf = new SparkConf().setAppName("AccTest").setMaster("yarn-client")
        val sc = new SparkContext(conf)
        sc.setLogLevel("ERROR")

        val accum = sc.accumulator(0, "My Accumulator")
        sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)

        println("count:" + accum.value)

        sc.stop
    }
}

worked

100

answered Sep 20 '22 20:09

sreedhar

Related questions
                            
                                Code completion issues with the Scala-IDE and Eclipse Juno
                            
                                Scala play http filters: how to find the request body
                            
                                How to run simple Spark app from Eclipse/Intellij IDE?
                            
                                How to run tests on every code change in IntelliJ IDEA from Scala sbt project?
                            
                                scala.ScalaReflectionException: <none> is not a term
                            
                                Accessing HBase tables through Spark
                            
                                How to Marshall a Future[Option[Foo]] class to JSON in AKKA-HTTP
                            
                                How to concretely set abstract type with type bound?
                            
                                What is the canonical way to deploy Scala/Akka microservices?
                            
                                Scala doobie fragment with generic type parameter
                            
                                How to set breakpoints in vs code in a scala program
                            
                                Merge Schema with int and double cannot be resolved when reading parquet file
                            
                                Why I can't apply just an underscore to first parameter in Scala?
                            
                                Is there a good date/time API available for Scala? [closed]
                            
                                running hprof from sbt
                            
                                Best practice for detecting changes to functions in Scala programs?
                            
                                howto distinguish composition and self-typing use-cases
                            
                                Scala Futures are slow with many cores
                            
                                How to infer the right type parameter from a projection type?
                            
                                How to use IO with Scalaz7 Iteratees without overflowing the stack?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Accumulator fails on cluster, works locally

Tags:

scala

apache-spark

mapreduce

ptikobj

People also ask

1 Answers

sreedhar

Recent Activity

Donate For Us