I have a DataProc Spark cluster. Initially, the master and 2 worker nodes are of type n1-standard-4 (4 vCPU, 15.0 GB memory), then I resized all of them to n1-highmem-8 (8 vCPUs, 52 GB memory) via the web console.
I noticed that the two workers nodes are not being fully used. In particular, there are only 2 executors on the first worker node and 1 executor on the second worker node, with
spark.executor.cores 2
spark.executor.memory 4655m
in the /usr/lib/spark/conf/spark-defaults.conf
. I thought with spark.dynamicAllocation.enabled true
, the number of executors will be increased automatically.
Also, The information on DataProc page of the web console doesn't get updated automatically, either. It seems that DataProc still think that all nodes are n1-standard-4.
My questions are
As you've found a cluster's configuration is set when the cluster is first created and does not adjust to manual resizing.
To answer your questions:
Instead of resizing cluster nodes and manually editing configuration files afterwards, consider starting a new cluster with new machine sizes and copy any data from your old cluster to the new cluster. In general, the simplest way to move data is to use hadoop's built in distcp utility. An example usage would be something along the lines of:
$ hadoop distcp hdfs:///some_directory hdfs://other-cluster-m:8020/
Or if you can use Cloud Storage:
$ hadoop distcp hdfs:///some_directory gs://<your_bucket>/some_directory
Alternatively, consider always storing data in Cloud Storage and treating each cluster as an ephemeral resource that can be torn down and recreated at any time. In general, any time you would save data to HDFS, you can also save it as:
gs://<your_bucket>/path/to/file
Saving to GCS has the nice benefit of allowing you to delete your cluster (and data in HDFS, on persistent disks) when not in use.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With