In Apache Spark program, how do we know which part of code will execute in driver program and which part of code will execute in worker nodes?

It is actually pretty simple. Everything that happens inside the closure created by a transformation happens on a worker. It means if something is passed inside <code>map(...)</code>, <code>filter(...)</code>, <code>mapPartitions(...)</code>, <code>groupBy*(...)</code>, <code>aggregateBy*(...)</code> is executed on the workers. It includes reading data from a persistent storage or remote sources. Actions like <code>count</code>, <code>reduce(...)</code>, <code>fold(...)</code> are usually executed on both driver and workers. Heavy lifting is performed in parallel by the workers and some final steps, like reducing outputs received from the workers, is performed sequentially on the driver. Everything else, like triggering an action or transformation happens on the driver. In particular it means every action which requires access to <code>SparkContext</code>. In PySpark it means also a communication with Py4j gateway.

Differentiate driver code and work code in Apache Spark

2 Answers

It is actually pretty simple. Everything that happens inside the closure created by a transformation happens on a worker. It means if something is passed inside map(...), filter(...), mapPartitions(...), groupBy*(...), aggregateBy*(...) is executed on the workers. It includes reading data from a persistent storage or remote sources.

Actions like count, reduce(...), fold(...) are usually executed on both driver and workers. Heavy lifting is performed in parallel by the workers and some final steps, like reducing outputs received from the workers, is performed sequentially on the driver.

Everything else, like triggering an action or transformation happens on the driver. In particular it means every action which requires access to SparkContext. In PySpark it means also a communication with Py4j gateway.

answered Sep 20 '22 15:09

zero323

All the closures passed as argument to method of JavaRDD/JavaPairRDD/similar and some method of these classes will be executed by spark nodes. Everything else is driver code.

answered Sep 21 '22 15:09

Jack

Related questions
                            
                                Apply function to each row of Spark DataFrame
                            
                                Multiple Spark applications with HiveContext
                            
                                How to optimize spark sql to run it in parallel
                            
                                snakeyaml and spark results in an inability to construct objects
                            
                                Reading in multiple files compressed in tar.gz archive into Spark [duplicate]
                            
                                Spark is not using all configured memory
                            
                                Why Does Spark Query (Load) from Oracle Is So Slow Comparing to SQOOP?
                            
                                Livy Server: return a dataframe as JSON?
                            
                                Online learning of LDA model in Spark
                            
                                Can Spark read data directly into a nested case class?
                            
                                Using airflow to run spark streaming jobs?
                            
                                Should cache and checkpoint be used together on DataSets? If so, how does this work under the hood?
                            
                                PySpark; DecimalType multiplication precision loss
                            
                                Understanding parallelism in Spark and Scala
                            
                                How to read XML files from apache spark framework?
                            
                                Change hadoop version using spark-ec2
                            
                                Spark SQL HiveContext - saveAsTable creates wrong schema
                            
                                Iterate through a Java RDD by row
                            
                                Is Spark zipWithIndex safe with parallel implementation?
                            
                                spark submit java.lang.ClassNotFoundException

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Differentiate driver code and work code in Apache Spark

Tags:

execution

driver

apache-spark

worker

Gopinathan K M

People also ask

2 Answers

zero323

Jack

Recent Activity

Donate For Us