I run my spark application in yarn cluster. In my code I use number available cores of queue for creating partitions on my dataset:
Dataset ds = ...
ds.coalesce(config.getNumberOfCores());
My question: how can I get number available cores of queue by programmatically way and not by configuration?
From basic math (X * Y= 15), we can see that there are four different executor & core combinations that can get us to 15 Spark cores per node: Possible configurations for executor.
The number of cores can be specified with the --executor-cores flag when invoking spark-submit, spark-shell, and pyspark from the command line, or by setting the spark. executor. cores property in the spark-defaults. conf file or on a SparkConf object.
According to Databricks if the driver and executors are of the same node type, this is the way to go:
java.lang.Runtime.getRuntime.availableProcessors * (sc.statusTracker.getExecutorInfos.length -1)
Found this while looking for the answer to pretty much the same question.
I found that:
Dataset ds = ...
ds.coalesce(sc.defaultParallelism());
does exactly what the OP was looking for.
For example, my 5 node x 8 core cluster returns 40 for the defaultParallelism
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With