What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?

Tags:

I am using Spark SQL actually hiveContext.sql() which uses group by queries and I am running into OOM issues. So thinking of increasing value of spark.sql.shuffle.partitions from 200 default to 1000 but it is not helping.

I believe this partition will share data shuffle load so more the partitions less data to hold. I am new to Spark. I am using Spark 1.4.0 and I have around 1TB of uncompressed data to process using hiveContext.sql() group by queries.

779

asked Sep 02 '15 09:09

Umesh K

1 Answers

If you're running out of memory on the shuffle, try setting spark.sql.shuffle.partitions to 2001.

Spark uses a different data structure for shuffle book-keeping when the number of partitions is greater than 2000:

private[spark] object MapStatus {    def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {     if (uncompressedSizes.length > 2000) {       HighlyCompressedMapStatus(loc, uncompressedSizes)     } else {       new CompressedMapStatus(loc, uncompressedSizes)     }   } ...

I really wish they would let you configure this independently.

By the way, I found this information in a Cloudera slide deck.

answered Oct 12 '22 02:10

nont

Related questions
                            
                                Understanding Spark serialization
                            
                                Resolving dependency problems in Apache Spark
                            
                                Pivot String column on Pyspark Dataframe
                            
                                Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?
                            
                                What is the difference between rowsBetween and rangeBetween?
                            
                                Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python
                            
                                How do I split an RDD into two or more RDDs?
                            
                                Encoder error while trying to map dataframe row to updated row
                            
                                How to convert unix timestamp to date in Spark
                            
                                NoClassDefFoundError com.apache.hadoop.fs.FSDataInputStream when execute spark-shell
                            
                                Drop spark dataframe from cache
                            
                                Why does spark-submit and spark-shell fail with "Failed to find Spark assembly JAR. You need to build Spark before running this program."?
                            
                                Spark using python: How to resolve Stage x contains a task of very large size (xxx KB). The maximum recommended task size is 100 KB
                            
                                How can I connect to a postgreSQL database into Apache Spark using scala?
                            
                                Cleanest, most efficient syntax to perform DataFrame self-join in Spark
                            
                                SparkSQL vs Hive on Spark - Difference and pros and cons?
                            
                                Compute size of Spark dataframe - SizeEstimator gives unexpected results
                            
                                build.sbt: how to add spark dependencies
                            
                                Why spark-shell fails with NullPointerException?
                            
                                Pyspark convert a standard list to data frame [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?

Tags:

apache-spark

apache-spark-sql

Umesh K

People also ask

1 Answers

nont

Recent Activity

Donate For Us