Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Spark's client mode, the driver needs network access to remote executors?

When using spark at client mode (e.g. yarn-client), does the local machine that runs the driver communicates directly with the cluster worker nodes that run the remote executors?

If yes, does it mean the machine (that runs the driver) need to have network access to the worker nodes? So the master node requests resources from the cluster, and returns the IP addresses/ports of the worker nodes to the driver, so the driver can initiating the communication with the worker nodes?

If not, how does the client mode actually work?

If yes, does it mean that the client mode won't work if the cluster is configured in a way that the work nodes are not visible outside the cluster, and one will have to use cluster mode?

Thanks!

like image 623
Lost In Translation Avatar asked Sep 29 '15 19:09

Lost In Translation


People also ask

What type of process are the driver and the executors?

The driver and each of the executors run in their own Java processes. The driver is the process where the main method runs. First it converts the user program into tasks and after that it schedules the tasks on the executors.

How do you prevent Spark executors from getting lost when using yarn client mode?

try increasing executor memory. one of the common reason of executor failures is insufficient memory. when executor consumes more memory then assigned yarn kills it.

What is the difference between cluster mode and client mode in Spark?

In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

How do you choose the driver and executor memory in Spark?

Determine the memory resources available for the Spark application. Multiply the cluster RAM size by the YARN utilization percentage. Provides 5 GB RAM for available drivers and 50 GB RAM available for worker nodes. Discount 1 core per worker node to determine the executor core instances.


1 Answers

The Driver connects to the Spark Master, requests a context, and then the Spark Master passes the Spark Workers the details of the Driver to communicate and get instructions on what to do.

The means that the driver node must be available on the network to the workers, and it's IP must be one that's visible to them (i.e. if the driver is behind NAT, while the workers are in a different network, it won't work and you'll see errors on the workers that they fail to connect to the driver)

like image 85
Romi Kuntsman Avatar answered Nov 15 '22 09:11

Romi Kuntsman