I am running Spark 1.6.2 on Google Cloud Dataproc (so Dataproc version 1.0). My cluster consists of a few of n1-standard-8
workers, and I am running one executor per core (spark.executor.cores=1
).
I see that my overall CPU utilization never gets above 50%, even though each worker is running the right number of executors (I'm leaving one core on each worker for OS, etc.).
I'm wondering if it's possible at all to somehow run more executors on each worker to utilize the cluster more fully? If so, what are the settings that I need to specify?
The lscpu
dump on the worker machines looks like this:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU @ 2.50GHz
Stepping: 4
CPU MHz: 2500.000
BogoMIPS: 5000.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-7
Thanks for any suggestions!
By default, YARN only schedules containers (spark executors in this case) based on memory, not on the number of cores they're requesting. Dataproc sets executor memory so that there are 2 executors per node.
spark.executor.cores
essentially ignored in the context of YARN, but it is used to decide how many tasks to run in parallel. If you're lowering spark.executor.cores
, but not executor memory, you are actually reducing parallelism!
You should instead leave executor memory alone and bump up spark.executor.cores
. On an n1-standard-4, you should be able to raise spark.executor.cores
from 2 to 4 with no problem.
If you try to set spark.executor.cores
higher than the number of YARN vcores on a node, Spark will complain. You can fix this by setting yarn.nodemanager.resource.cpu-vcores=<large-number>
. Then, <large-number>
will be the new upper bound.
Depending on how I/O bound your job is, you can easily double or quadruple spark.executor.cores
, if not more. Writing files to GCS tends to be very I/O bound.
Note that while you can specify the spark properties when running a spark job, you'll only be able to specify that YARN property when creating a cluster:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With