In Pyspark, I can create a RDD from a list and decide how many partitions to have: <pre class="prettyprint"><code>sc = SparkContext() sc.parallelize(xrange(0, 10), 4) </code></pre> How does the number of partitions I decide to partition my RDD in influence the performance? And how does this depend on the number of core my machine has?

The primary effect would be by specifying too few partitions or far too many partitions. Too few partitions You will not utilize all of the cores available in the cluster. Too many partitions There will be excessive overhead in managing many small tasks. Between the two the first one is far more impactful on performance. Scheduling too many smalls tasks is a relatively small impact at this point for partition counts below 1000. If you have on the order of tens of thousands of partitions then spark gets very slow.

Number of partitions in RDD and performance in Spark

Tags:

performance

apache-spark

rdd

pyspark

In Pyspark, I can create a RDD from a list and decide how many partitions to have:

sc = SparkContext() sc.parallelize(xrange(0, 10), 4)

How does the number of partitions I decide to partition my RDD in influence the performance? And how does this depend on the number of core my machine has?

782

asked Mar 04 '16 16:03

mar tin

1 Answers

The primary effect would be by specifying too few partitions or far too many partitions.

Too few partitions You will not utilize all of the cores available in the cluster.

Too many partitions There will be excessive overhead in managing many small tasks.

Between the two the first one is far more impactful on performance. Scheduling too many smalls tasks is a relatively small impact at this point for partition counts below 1000. If you have on the order of tens of thousands of partitions then spark gets very slow.

173

answered Oct 12 '22 02:10

WestCoastProjects

Related questions
                            
                                How to speed up Java VM (JVM) startup time?
                            
                                What could cause global Tomcat/JVM slowdown?
                            
                                Fast algorithm implementation to sort very small list
                            
                                Speed accessing a std::vector by iterator vs by operator[]/index?
                            
                                Subquery v/s inner join in sql server
                            
                                BETWEEN clause versus <= AND >=
                            
                                Using AVX intrinsics instead of SSE does not improve speed -- why?
                            
                                B trees vs binary trees
                            
                                Why is creating a HashMap faster than creating an Object[]?
                            
                                What is the performance hit of Performance Counters
                            
                                Is fastcall really faster?
                            
                                How is the jQuery selector $('#foo a') evaluated?
                            
                                Fastest technique to pass messages between processes on Linux?
                            
                                StringBuilder/StringBuffer vs. "+" Operator
                            
                                Why is this loop faster than a dictionary comprehension for creating a dictionary?
                            
                                JavaScript performance difference between double equals (==) and triple equals (===)
                            
                                Which is better way to calculate nCr
                            
                                Random sample of character vector, without elements prefixing one another
                            
                                Difference between mt_rand() and rand()
                            
                                Does php run faster without warnings?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With