I'm running some operations in PySpark, and recently increased the number of nodes in my configuration (which is on Amazon EMR). However, even though I tripled the number of nodes (from 4 to 12), performance seems not to have changed. As such, I'd like to see if the new nodes are visible to Spark.
I'm calling the following function:
sc.defaultParallelism >>>> 2
But I think this is telling me the total number of tasks distributed to each node, not the total number of nodes that Spark can see.
How do I go about seeing the amount of nodes that PySpark is using in my cluster?
Number of available executors = (total cores/num-cores-per-executor) = 150/5 = 30.
Multiply the number of cluster cores by the YARN utilization percentage. Provides 3 driver and 30 worker node cores. Determine the memory resources available for the Spark application. Multiply the cluster RAM size by the YARN utilization percentage.
Go to the Spark History user interface and then open the incomplete application. Locate the application ID that you found above and open it. Go to the executor tabs and you will see list of executors.
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.
On pyspark you could still call the scala getExecutorMemoryStatus
API using pyspark's py4j bridge:
sc._jsc.sc().getExecutorMemoryStatus().size()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With