How spark handles object

Tags:

To test the Serialization exception in spark I wrote a task in 2 ways.
First way:

package examples
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object dd {
  def main(args: Array[String]):Unit = {
    val sparkConf = new SparkConf
    val sc = new SparkContext(sparkConf)

    val data = List(1,2,3,4,5)
    val rdd = sc.makeRDD(data)
    val result = rdd.map(elem => {
      funcs.func_1(elem)
    })        
    println(result.count())
  }
}

object funcs{
  def func_1(i:Int): Int = {
    i + 1
  }
}

This way spark works pretty good.
While when I change it to following way, it does not work and throws NotSerializableException.
Second way:

package examples
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object dd {
  def main(args: Array[String]):Unit = {
    val sparkConf = new SparkConf
    val sc = new SparkContext(sparkConf)

    val data = List(1,2,3,4,5)
    val rdd = sc.makeRDD(data)

    val handler = funcs
    val result = rdd.map(elem => {
      handler.func_1(elem)
    })

    println(result.count())

  }
}

object funcs{
  def func_1(i:Int): Int = {
    i + 1
  }
}

I know the reason I got error "task is not serializable" is because I am trying to send an unserializable object funcs from driver node to worker node in second example. For second example, if I make object funcs extend Serializable, this error will gone.

But In my view, because funcs is an object rather than a class, it is a singleton and supposed to be serialized and shipped from driver to workers instead of instantiating within a worker node itself. In this scenario, although way to use object funcs is different, I guess the unserializable object funcs is shipped from driver node to worker node in both of these 2 examples.

My question is why the first example can be run successfully but second one fails with 'task unserializable' exception.

450

asked Nov 14 '16 20:11

Frankie

2 Answers

When you run code in an RDD closure (map, filter, etc...), everything necessary to execute that code will be packaged up, serialized, and sent to the executors to be run. Any objects that are referenced (or whose fields are referenced) will be serialized in this task, and this is where you'll sometimes get a NotSerializableException.

Your use case is a little more complicated, though, and involves the scala compiler. Typically, calling a function on a scala object is the equivalent of calling a java static method. That object never really exists -- it's basically like writing the code inline. However, if you assign an object to a variable, then you're actually creating a reference to that object in memory, and the object behaves more like a class, and can have serialization issues.

scala> object A { 
  def foo() { 
    println("bar baz")
  }
}
defined module A

scala> A.foo()  // static method
bar baz

scala> val a = A  // now we're actually assigning a memory location
a: A.type = A$@7e0babb1

scala> a.foo()  // dereferences a before calling foo
bar baz

answered Oct 27 '22 01:10

Tim

In order for Spark to distribute a given operation, the function used in the operation needs to be serialized. Before serialization, these functions pass through a complex process appropriately called "ClosureCleaner".

The intention is to "cut off" closures from their context in order to reduce the size of the object graph needed to be serialized and reduce the risk of serialization issues in the process. In other words, ensure that only the code needed to execute the function is serialized and sent for deserialization and execution "at the other side"

During that process, the closure is also evaluated to be Serializable to be proactive about detecting serialization issues at runtime (SparkContext#clean).

That code is dense and complex so it's hard to find the right code path leading to this case.

Intuitively, what's happening is that when the ClosureCleaner finds:

val result = rdd.map{elem => 
  funcs.func_1(elem)
}

It evaluates the inner members of the closure to be from an object that can be recreated and there are no further references, so the cleaned closure only contains {elem => funcs.func_1(elem)} which can be serialized by the JavaSerializer.

Instead, when the closure cleaner evaluates:

val handler = funcs
val result = rdd.map(elem => {
  handler.func_1(elem)
})

It finds that the closure has a reference to $outer (handler), hence it inspects the outer scope and adds the and variable instance to the cleaned closure. We could imagine the resulting cleaned closure to be something of this shape (this is for illustrative purposes only):

{elem => 
  val handler = funcs
  handler.func_1(elem)
}

When the closure is tested for serialization, it fails to serialize. Per JVM serialization rules, an object is serializable if recursively all its members are serializable. In this case handler references a non-serializable object and the check fails.

answered Oct 26 '22 23:10

maasg

Related questions
                            
                                What's the difference between a Viewsets `create()` and `update()` and a Serializers `create()` and `update()`?
                            
                                Writing a Large JSON Array To File
                            
                                How do I solve AddJsonOptions does not contain definition of SerializerSettings - .NET
                            
                                Store and retrieve a multidimensional array using php and mysql
                            
                                Serializing java.util.Date
                            
                                Exclude an object during serialization with XmlSerializer
                            
                                Proguard keep classmembers
                            
                                Can RestSharp send a List<string> in a POST request?
                            
                                Save file - xmlSerializer
                            
                                Is There A Way to Truncate by Words in View in Django?
                            
                                XML serialization force full closing tag on null or empty value
                            
                                Can't import serializer from other serializer in django rest-framework?
                            
                                How to serialize(JSON) FileField in Django
                            
                                ColdFusion: Does anyone use WDDX?
                            
                                'str' object has no attribute '__dict__'
                            
                                cPickle - different results pickling the same object
                            
                                Can a Python generator be easily saved and reloaded from disk?
                            
                                MongoDB C# Driver - How to InsertBatch using a List of Dictionary<string, string>
                            
                                jQuery serialize() exclude all elements within div.classname
                            
                                How to safely serialize a lambda?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How spark handles object

Tags:

serialization

apache-spark

rdd

Frankie

People also ask

2 Answers

Tim

maasg

Recent Activity

Donate For Us