Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Passing Functions to Spark: What is the risk of referencing the whole object?

According to Passing Functions to Spark,it claims:

accessing fields of the outer object will reference the whole object; To avoid this issue ...

I am considering that what is the risk of flowing code:

class MyClass {
  val field = "Hello"
  def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(x => field + x) }
}

references all of this would do any harm ?

like image 235
chenzhongpu Avatar asked Mar 04 '15 01:03

chenzhongpu


Video Answer


1 Answers

This will cause Spark to serialize your whole object and send it to each of the executors. If some of the fields of your object contain big amounts of data, it might be slow. Also it might cause task not serializable exception if your object is not serializable

Here's an example of the guy with this problem: Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects

like image 111
0x0FFF Avatar answered Oct 11 '22 15:10

0x0FFF