Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Differentiate driver code and work code in Apache Spark

In Apache Spark program, how do we know which part of code will execute in driver program and which part of code will execute in worker nodes?

like image 457
Gopinathan K M Avatar asked Oct 26 '15 05:10

Gopinathan K M


People also ask

What is the driver code in Spark?

Driver is a Java process. This is the process where the main() method of our Scala, Java, Python program runs. It executes the user code and creates a SparkSession or SparkContext and the SparkSession is responsible to create DataFrame, DataSet, RDD, execute SQL, perform Transformation & Action, etc.

What is the difference between driver and executor in Spark?

The executors are responsible for actually executing the work that the driver assigns them. This means, each executor is responsible for only two things: executing code assigned to it by the driver and reporting the state of the computation, on that executor, back to the driver node.

What is master and driver in Spark?

Master is per cluster, and Driver is per application. For standalone/yarn clusters, Spark currently supports two deploy modes. In client mode, the driver is launched in the same process as the client that submits the application.

What is the role of driver in Spark architecture?

The driver is the process that runs the user code that creates RDDs, and performs transformation and action, and also creates SparkContext. When the Spark Shell is launched, this signifies that we have created a driver program. On the termination of the driver, the application is finished.


2 Answers

It is actually pretty simple. Everything that happens inside the closure created by a transformation happens on a worker. It means if something is passed inside map(...), filter(...), mapPartitions(...), groupBy*(...), aggregateBy*(...) is executed on the workers. It includes reading data from a persistent storage or remote sources.

Actions like count, reduce(...), fold(...) are usually executed on both driver and workers. Heavy lifting is performed in parallel by the workers and some final steps, like reducing outputs received from the workers, is performed sequentially on the driver.

Everything else, like triggering an action or transformation happens on the driver. In particular it means every action which requires access to SparkContext. In PySpark it means also a communication with Py4j gateway.

like image 71
zero323 Avatar answered Sep 20 '22 15:09

zero323


All the closures passed as argument to method of JavaRDD/JavaPairRDD/similar and some method of these classes will be executed by spark nodes. Everything else is driver code.

like image 30
Jack Avatar answered Sep 21 '22 15:09

Jack