Spark: Find Each Partition Size for RDD

Tags:

What's the best way of finding each partition size for a given RDD. I'm trying to debug a skewed Partition issue, I've tried this:

l = builder.rdd.glom().map(len).collect()  # get length of each partition
print('Min Parition Size: ',min(l),'. Max Parition Size: ', max(l),'. Avg Parition Size: ', sum(l)/len(l),'. Total Partitions: ', len(l))

It works fine for small RDDs, but for bigger RDDs, it is giving OOM error. My idea is that glom() is causing this to happen. But anyway, just wanted to know if there is any better way to do it?

949

asked Dec 09 '16 20:12

anwartheravian

2 Answers

Use:

builder.rdd.mapPartitions(lambda it: [sum(1 for _ in it)])

145

answered Nov 13 '22 09:11

user6022341

While the answer by @LostInOverflow works great. I've found another way to find the size as well as index of each partition, using the code below. Thanks to this awesome post.

Here is the code:

l = test_join.rdd.mapPartitionsWithIndex(lambda x,it: [(x,sum(1 for _ in it))]).collect()

and then you can get the max and min size partitions using this code:

min(l,key=lambda item:item[1])
max(l,key=lambda item:item[1])

Finding the key of the skewed partition, we can further debug the content of the that partition, if needed.

answered Nov 13 '22 09:11

anwartheravian

Related questions
                            
                                How to fix "Forbidden!Configured service account doesn't have access" with Spark on Kubernetes?
                            
                                How to change SparkContext properties in Interactive PySpark session
                            
                                Flatten Nested Spark Dataframe
                            
                                How to pass a constant value to Python UDF?
                            
                                How to debug a scala based Spark program on Intellij IDEA
                            
                                How to use two versions of spark shell?
                            
                                Partitioning in spark while reading from RDBMS via JDBC
                            
                                Apache Spark: java.lang.NoSuchMethodError .rddToPairRDDFunctions
                            
                                Spark: Inconsistent performance number in scaling number of cores
                            
                                Profiling a Scala Spark application
                            
                                Why is Spark faster than Hadoop Map Reduce
                            
                                Count on Spark Dataframe is extremely slow
                            
                                to_date fails to parse date in Spark 3.0
                            
                                How to implement custom job listener/tracker in Spark?
                            
                                How to implement "Cross Join" in Spark?
                            
                                How to zip two (or more) DataFrame in Spark
                            
                                Running EMR Spark With Multiple S3 Accounts
                            
                                How to select and order multiple columns in a Pyspark Dataframe after a join
                            
                                Timeout Exception in Apache-Spark during program Execution
                            
                                How to split pipe-separated column into multiple rows?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: Find Each Partition Size for RDD

Tags:

apache-spark

apache-spark-sql

pyspark

spark-dataframe

anwartheravian

People also ask

2 Answers

user6022341

anwartheravian

Recent Activity

Donate For Us