I already have a cluster of 3 machines (ubuntu1,ubuntu2,ubuntu3 by VM virtualbox) running Hadoop 1.0.0. I installed spark on each of these machines. ub1 is my master node and the other nodes are working as slave. My question is what exactly a spark driver is? and should we set a IP and port to spark driver by spark.driver.host
and where it will be executed and located? (master or slave)
The Spark driver is used to orchestrate the whole Spark cluster, this means it will manage the work which is distributed across the cluster as well as what machines are available throughout the cluster lifetime.
The driver is the process that runs the user code that creates RDDs, and performs transformation and action, and also creates SparkContext. When the Spark Shell is launched, this signifies that we have created a driver program. On the termination of the driver, the application is finished.
Once the Physical Plan is generated, Spark allocates the Tasks to the Executors. Task runs on Executor and each Task upon completion returns the result to the Driver. Finally, when all Task is completed, the main() method running in the Driver exits, i.e. main() method invokes sparkContext. stop().
In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master.
In practical terms, the driver is the program that creates the SparkContext, connecting to a given Spark Master. In the case of a local cluster, like is your case, the master_url=spark://<host>:<port>
Its location is independent of the master/slaves. You could co-located with the master or run it from another node. The only requirement is that it must be in a network addressable from the Spark Workers.
This is how the configuration of your driver looks like:
val conf = new SparkConf() .setMaster("master_url") // this is where the master is specified .setAppName("SparkExamplesMinimal") .set("spark.local.ip","xx.xx.xx.xx") // helps when multiple network interfaces are present. The driver must be in the same network as the master and slaves .set("spark.driver.host","xx.xx.xx.xx") // same as above. This duality might disappear in a future version val sc = new spark.SparkContext(conf) // etc...
To explain a bit more on the different roles:
You question is related to spark deploy on yarn, see 1: http://spark.apache.org/docs/latest/running-on-yarn.html "Running Spark on YARN"
Assume you start from a spark-submit --master yarn
cmd :
The Yellow box "Spark context" is the Driver.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With