According to Passing Functions to Spark,it claims:
accessing fields of the outer object will reference the whole object; To avoid this issue ...
I am considering that what is the risk of flowing code:
class MyClass {
val field = "Hello"
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(x => field + x) }
}
references all of this would do any harm ?
This will cause Spark to serialize your whole object and send it to each of the executors. If some of the fields of your object contain big amounts of data, it might be slow. Also it might cause task not serializable
exception if your object is not serializable
Here's an example of the guy with this problem: Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With