What will be difference and use of all these?
How to fix a machine as a Driver in Spark standalone cluster ?
the ApplicationMaster connect to spark Driver by spark.driver.host
spark Driver bind to bindAddress on the client machine
.config('spark.driver.port','50243')
then netstat -ano
on windows
TCP 172.18.1.194:50243 0.0.0.0:0 LISTENING 15332
TCP 172.18.1.194:50243 172.18.7.122:54451 ESTABLISHED 15332
TCP 172.18.1.194:50243 172.18.7.124:37412 ESTABLISHED 15332
TCP 172.18.1.194:50243 172.18.7.142:41887 ESTABLISHED 15332
TCP [::]:4040 [::]:0 LISTENING 15332
The nodes in the cluster 172.18.7.1xx
are in the same network as my development machine 172.181.1.194
as my netmask is 255.255.248.0
.config('spark.driver.host','192.168.132.1')
then netstat -ano
TCP 192.168.132.1:58555 0.0.0.0:0 LISTENING 9480
TCP 192.168.132.1:58641 0.0.0.0:0 LISTENING 9480
TCP [::]:4040 [::]:0 LISTENING 9480
however the ApplicationMaster cannot connect and reported error
Caused by: java.net.NoRouteToHostException: No route to host
because this ip is a VM bridge on my development machine
.config('spark.driver.host','172.18.1.194')
.config('spark.driver.bindAddress','192.168.132.1')
then netstat -ano
TCP 172.18.1.194:63937 172.18.7.101:8032 ESTABLISHED 17412
TCP 172.18.1.194:63940 172.18.7.102:9000 ESTABLISHED 17412
TCP 172.18.1.194:63952 172.18.7.121:50010 ESTABLISHED 17412
TCP 192.168.132.1:63923 0.0.0.0:0 LISTENING 17412
TCP [::]:4040 [::]:0 LISTENING 17412
Before explain in detail, there are only these three related conf variables:
spark.driver.host
spark.driver.port
spark.driver.bindAddress
There are NO variables like spark.driver.hostname
or spark.local.ip
. But there IS a environment variable called SPARK_LOCAL_IP
and before explain the variables, first we have to understand the application submition process
Main Roles of computers:
There is an ApplicationMaster for each application, which takes care of resource request from cluster and status monitor of jobs(stages)
The ApplicationMaster is in the cluster, always.
Place of spark Driver
Let's say we are talking about client mode
The spark application can be submitted from a development machine, which act as a client machine of the application, as well as a client machine of the cluster.
The spark application can alternatively submitted from a node within the cluster (master node or worker node or just a specific machine with no resource manager role)
The client machine might not be placed within the same subnet as the cluster and this is one case that these variables try to deal with. Think about your internet connection, it is often not possible that your laptop can be accessed from anywhere around the globe just as google.com.
At the beginning of the application submission process, the spark-submit on the client side would upload necessary files to the spark master or yarn, and negotiate about resource requests. In this step the client connect to the cluster, and the cluster address is the destination address that the client tries to connect.
Then the ApplicationMaster starts on the allocated resource.
The resource allocated for ApplicationMaster is by default random, and cannot control by these variables. It is controlled by the scheduler of the cluster, if you're curious about this.
Then the ApplicationMaster tries to connect BACK to the spark Driver. This is the place that these conf variables take effects.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With