Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How spark driver serializes the task that is sent to executors?

Tags:

apache-spark

RDD goes through series of transformations with user defined functions / method in object. And these functions are passed to the executors in the form of tasks. These tasks are instances of a Scala class defined in spark-core.

I assume the user defined functions / methods are wrapped in a task object and passed to the executors.

  1. How do the executors know what is the method that needs to be executed which is wrapped in the task class?

  2. How exactly is the serialization helpful here?

  3. How does the spark context read the user code and convert it to tasks?

like image 473
Knight71 Avatar asked Jul 12 '15 09:07

Knight71


1 Answers

Spark function passing fundamentally is based on Java Serialization. In Java you can pass any arbitrary code to other machine over network, in can be simple case class or any class with any behavior.

Only one requirement — serialized class needs to be in class path of target JVM.

On startup when you use spark-submit it distributes your jar file to all Spark worker node, it allows driver to pass serialized functions to worker node, and because serialized class is in class path any function that sent from driver can be deserialized.

Spark doesn't define any specific Task class for RDD transformation. If you use it from Scala for map operations you are sending serialized versions of scala Function1.

If you use aggregate/reduce by key etc, it can be Function2. Anyway, it's not anything Spark specific, it's just plain Scala (Java) class.

like image 89
Eugene Zhulenev Avatar answered Oct 19 '22 17:10

Eugene Zhulenev