Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to update spark configuration after resizing worker nodes in Cloud Dataproc

I have a DataProc Spark cluster. Initially, the master and 2 worker nodes are of type n1-standard-4 (4 vCPU, 15.0 GB memory), then I resized all of them to n1-highmem-8 (8 vCPUs, 52 GB memory) via the web console.

I noticed that the two workers nodes are not being fully used. In particular, there are only 2 executors on the first worker node and 1 executor on the second worker node, with

spark.executor.cores 2
spark.executor.memory 4655m

in the /usr/lib/spark/conf/spark-defaults.conf. I thought with spark.dynamicAllocation.enabled true, the number of executors will be increased automatically.

Also, The information on DataProc page of the web console doesn't get updated automatically, either. It seems that DataProc still think that all nodes are n1-standard-4.

My questions are

  1. why are there more executors on the first worker node than the second?
  2. why are not more executors added to each node?
  3. Ideally, I want the whole cluster to get fully utilized, if the spark configuration needs updated, how?
like image 951
zyxue Avatar asked Aug 03 '16 17:08

zyxue


1 Answers

As you've found a cluster's configuration is set when the cluster is first created and does not adjust to manual resizing.

To answer your questions:

  1. The Spark ApplicationMaster takes a container in YARN on a worker node, usually the first worker if only a single spark application is running.
  2. When a cluster is started, Dataproc attempts to fit two YARN containers per machine.
  3. The YARN NodeManager configuration on each machine determines how much of the machine's resources should be dedicated to YARN. This can be changed on each VM under /etc/hadoop/conf/yarn-site.xml, followed by a sudo service hadoop-yarn-nodemanager restart. Once machines are advertising more resources to the ResourceManager, Spark can start more containers. After adding more resources to YARN, you may want to modify the size of containers requested by Spark by modifying spark.executor.memory and spark.executor.cores.

Instead of resizing cluster nodes and manually editing configuration files afterwards, consider starting a new cluster with new machine sizes and copy any data from your old cluster to the new cluster. In general, the simplest way to move data is to use hadoop's built in distcp utility. An example usage would be something along the lines of:

$ hadoop distcp hdfs:///some_directory hdfs://other-cluster-m:8020/

Or if you can use Cloud Storage:

$ hadoop distcp hdfs:///some_directory gs://<your_bucket>/some_directory

Alternatively, consider always storing data in Cloud Storage and treating each cluster as an ephemeral resource that can be torn down and recreated at any time. In general, any time you would save data to HDFS, you can also save it as:

gs://<your_bucket>/path/to/file

Saving to GCS has the nice benefit of allowing you to delete your cluster (and data in HDFS, on persistent disks) when not in use.

like image 56
Angus Davis Avatar answered Sep 21 '22 08:09

Angus Davis