Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is spark.local.ip ,spark.driver.host,spark.driver.bindAddress and spark.driver.hostname?

Tags:

apache-spark

What will be difference and use of all these?

  • spark.local.ip
  • spark.driver.host
  • spark.driver.bindAddress
  • spark.driver.hostname

How to fix a machine as a Driver in Spark standalone cluster ?

like image 431
xyz_scala Avatar asked Apr 29 '17 06:04

xyz_scala


1 Answers

Short Version

the ApplicationMaster connect to spark Driver by spark.driver.host

spark Driver bind to bindAddress on the client machine

by examples

1 example of port binding

.config('spark.driver.port','50243')

then netstat -ano on windows

TCP    172.18.1.194:50243     0.0.0.0:0              LISTENING       15332
TCP    172.18.1.194:50243     172.18.7.122:54451     ESTABLISHED     15332
TCP    172.18.1.194:50243     172.18.7.124:37412     ESTABLISHED     15332
TCP    172.18.1.194:50243     172.18.7.142:41887     ESTABLISHED     15332
TCP    [::]:4040              [::]:0                 LISTENING       15332

The nodes in the cluster 172.18.7.1xx are in the same network as my development machine 172.181.1.194 as my netmask is 255.255.248.0

2 example of specify ip from ApplicationMaster to Driver

.config('spark.driver.host','192.168.132.1')

then netstat -ano

TCP    192.168.132.1:58555    0.0.0.0:0              LISTENING       9480
TCP    192.168.132.1:58641    0.0.0.0:0              LISTENING       9480
TCP    [::]:4040              [::]:0                 LISTENING       9480

however the ApplicationMaster cannot connect and reported error

Caused by: java.net.NoRouteToHostException: No route to host

because this ip is a VM bridge on my development machine

3 example of ip bind

.config('spark.driver.host','172.18.1.194')
.config('spark.driver.bindAddress','192.168.132.1')

then netstat -ano

TCP    172.18.1.194:63937     172.18.7.101:8032      ESTABLISHED     17412
TCP    172.18.1.194:63940     172.18.7.102:9000      ESTABLISHED     17412
TCP    172.18.1.194:63952     172.18.7.121:50010     ESTABLISHED     17412
TCP    192.168.132.1:63923    0.0.0.0:0              LISTENING       17412
TCP    [::]:4040              [::]:0                 LISTENING       17412

Detailed Version

Before explain in detail, there are only these three related conf variables:

  • spark.driver.host
  • spark.driver.port
  • spark.driver.bindAddress

There are NO variables like spark.driver.hostname or spark.local.ip. But there IS a environment variable called SPARK_LOCAL_IP

and before explain the variables, first we have to understand the application submition process

Main Roles of computers:

  • development machine
  • master node (YARN / Spark Master)
  • worker node

There is an ApplicationMaster for each application, which takes care of resource request from cluster and status monitor of jobs(stages)

The ApplicationMaster is in the cluster, always.

Place of spark Driver

  • development machine: client mode
  • within the cluster: cluster mode, same place as the ApplicationMaster

Let's say we are talking about client mode

The spark application can be submitted from a development machine, which act as a client machine of the application, as well as a client machine of the cluster.

The spark application can alternatively submitted from a node within the cluster (master node or worker node or just a specific machine with no resource manager role)

The client machine might not be placed within the same subnet as the cluster and this is one case that these variables try to deal with. Think about your internet connection, it is often not possible that your laptop can be accessed from anywhere around the globe just as google.com.

At the beginning of the application submission process, the spark-submit on the client side would upload necessary files to the spark master or yarn, and negotiate about resource requests. In this step the client connect to the cluster, and the cluster address is the destination address that the client tries to connect.

Then the ApplicationMaster starts on the allocated resource.

The resource allocated for ApplicationMaster is by default random, and cannot control by these variables. It is controlled by the scheduler of the cluster, if you're curious about this.

Then the ApplicationMaster tries to connect BACK to the spark Driver. This is the place that these conf variables take effects.

like image 156
cdarlint Avatar answered Dec 11 '22 19:12

cdarlint