RDD goes through series of transformations with user defined functions / method in object. And these functions are passed to the executors in the form of tasks. These tasks are instances of a Scala class defined in spark-core.
I assume the user defined functions / methods are wrapped in a task object and passed to the executors.
How do the executors know what is the method that needs to be executed which is wrapped in the task class?
How exactly is the serialization helpful here?
How does the spark context read the user code and convert it to tasks?
Spark function passing fundamentally is based on Java Serialization. In Java you can pass any arbitrary code to other machine over network, in can be simple case class or any class with any behavior.
Only one requirement — serialized class needs to be in class path of target JVM.
On startup when you use spark-submit
it distributes your jar
file to all Spark worker node, it allows driver to pass serialized functions to worker node, and because serialized class is in class path any function that sent from driver can be deserialized.
Spark doesn't define any specific Task
class for RDD transformation. If you use it from Scala for map
operations you are sending serialized versions of scala Function1
.
If you use aggregate/reduce by key etc, it can be Function2
. Anyway, it's not anything Spark specific, it's just plain Scala (Java) class.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With