Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

getting number of visible nodes in PySpark

Tags:

I'm running some operations in PySpark, and recently increased the number of nodes in my configuration (which is on Amazon EMR). However, even though I tripled the number of nodes (from 4 to 12), performance seems not to have changed. As such, I'd like to see if the new nodes are visible to Spark.

I'm calling the following function:

sc.defaultParallelism >>>> 2 

But I think this is telling me the total number of tasks distributed to each node, not the total number of nodes that Spark can see.

How do I go about seeing the amount of nodes that PySpark is using in my cluster?

like image 580
Bryan Avatar asked Feb 27 '15 15:02

Bryan


People also ask

How do you find the number of executors in PySpark?

Number of available executors = (total cores/num-cores-per-executor) = 150/5 = 30.

How do you determine the number of nodes in a Spark cluster?

Multiply the number of cluster cores by the YARN utilization percentage. Provides 3 driver and 30 worker node cores. Determine the memory resources available for the Spark application. Multiply the cluster RAM size by the YARN utilization percentage.

How do you determine the number of executors in a Spark?

Go to the Spark History user interface and then open the incomplete application. Locate the application ID that you found above and open it. Go to the executor tabs and you will see list of executors.

What is an RDD PySpark?

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.


1 Answers

On pyspark you could still call the scala getExecutorMemoryStatus API using pyspark's py4j bridge:

sc._jsc.sc().getExecutorMemoryStatus().size() 
like image 106
Nic Avatar answered Sep 19 '22 14:09

Nic