In Apache Spark program, how do we know which part of code will execute in driver program and which part of code will execute in worker nodes?
Driver is a Java process. This is the process where the main() method of our Scala, Java, Python program runs. It executes the user code and creates a SparkSession or SparkContext and the SparkSession is responsible to create DataFrame, DataSet, RDD, execute SQL, perform Transformation & Action, etc.
The executors are responsible for actually executing the work that the driver assigns them. This means, each executor is responsible for only two things: executing code assigned to it by the driver and reporting the state of the computation, on that executor, back to the driver node.
Master is per cluster, and Driver is per application. For standalone/yarn clusters, Spark currently supports two deploy modes. In client mode, the driver is launched in the same process as the client that submits the application.
The driver is the process that runs the user code that creates RDDs, and performs transformation and action, and also creates SparkContext. When the Spark Shell is launched, this signifies that we have created a driver program. On the termination of the driver, the application is finished.
It is actually pretty simple. Everything that happens inside the closure created by a transformation happens on a worker. It means if something is passed inside map(...)
, filter(...)
, mapPartitions(...)
, groupBy*(...)
, aggregateBy*(...)
is executed on the workers. It includes reading data from a persistent storage or remote sources.
Actions like count
, reduce(...)
, fold(...)
are usually executed on both driver and workers. Heavy lifting is performed in parallel by the workers and some final steps, like reducing outputs received from the workers, is performed sequentially on the driver.
Everything else, like triggering an action or transformation happens on the driver. In particular it means every action which requires access to SparkContext
. In PySpark it means also a communication with Py4j gateway.
All the closures passed as argument to method of JavaRDD/JavaPairRDD/similar and some method of these classes will be executed by spark nodes. Everything else is driver code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With