Spark Driver in Apache spark

Tags:

apache-spark

I already have a cluster of 3 machines (ubuntu1,ubuntu2,ubuntu3 by VM virtualbox) running Hadoop 1.0.0. I installed spark on each of these machines. ub1 is my master node and the other nodes are working as slave. My question is what exactly a spark driver is? and should we set a IP and port to spark driver by spark.driver.host and where it will be executed and located? (master or slave)

235

asked Jul 08 '14 16:07

user3789843

2 Answers

The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master.

In practical terms, the driver is the program that creates the SparkContext, connecting to a given Spark Master. In the case of a local cluster, like is your case, the master_url=spark://<host>:<port>

Its location is independent of the master/slaves. You could co-located with the master or run it from another node. The only requirement is that it must be in a network addressable from the Spark Workers.

This is how the configuration of your driver looks like:

val conf = new SparkConf()       .setMaster("master_url") // this is where the master is specified       .setAppName("SparkExamplesMinimal")       .set("spark.local.ip","xx.xx.xx.xx") // helps when multiple network interfaces are present. The driver must be in the same network as the master and slaves       .set("spark.driver.host","xx.xx.xx.xx") // same as above. This duality might disappear in a future version  val sc = new spark.SparkContext(conf)     // etc...

To explain a bit more on the different roles:

The driver prepares the context and declares the operations on the data using RDD transformations and actions.
The driver submits the serialized RDD graph to the master. The master creates tasks out of it and submits them to the workers for execution. It coordinates the different job stages.
The workers is where the tasks are actually executed. They should have the resources and network connectivity required to execute the operations requested on the RDDs.

121

answered Oct 14 '22 16:10

maasg

You question is related to spark deploy on yarn, see 1: http://spark.apache.org/docs/latest/running-on-yarn.html "Running Spark on YARN"

Assume you start from a spark-submit --master yarn cmd :

The cmd will request yarn Resource Manager (RM) to start a ApplicationMaster (AM)process on one of your cluster machines (those have yarn node manager installled on it).
Once the AM started, it will call your driver program's main method. So the driver is actually where you define your spark context, your rdd, and your jobs. The driver contains the entry main method which start the spark computation.
The spark context will prepare RPC endpoint for the executor to talk back, and a lot of other things(memory store, disk block manager, jetty server...)
The AM will request RM for containers to run your spark executors, with the driver RPC url (something like spark://CoarseGrainedScheduler@ip:37444) specified on the executor's start cmd.

The Yellow box "Spark context" is the Driver. Yarn cluster mode

answered Oct 14 '22 17:10

fjolt

Related questions
                            
                                How to access s3a:// files from Apache Spark?
                            
                                PySpark - rename more than one column using withColumnRenamed
                            
                                How do I log from my Python Spark script
                            
                                PySpark: java.lang.OutofMemoryError: Java heap space
                            
                                Retrieve top n in each group of a DataFrame in pyspark
                            
                                PySpark: How to fillna values in dataframe for specific columns?
                            
                                How to convert a DataFrame back to normal RDD in pyspark?
                            
                                How to import multiple csv files in a single load?
                            
                                How to list all cassandra tables
                            
                                What is the concept of application, job, stage and task in spark?
                            
                                How to query JSON data column using Spark DataFrames?
                            
                                How to aggregate values into collection after groupBy?
                            
                                "Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory
                            
                                Spark: subtract two DataFrames
                            
                                Spark : how to run spark file from spark shell
                            
                                collect_list by preserving order based on another variable
                            
                                Apache Spark vs Akka [closed]
                            
                                Why is "Unable to find encoder for type stored in a Dataset" when creating a dataset of custom case class?
                            
                                Add an empty column to Spark DataFrame
                            
                                How DAG works under the covers in RDD?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With