Questions Linux Laravel Mysql Ubuntu Git Menu

HTML CSS JAVASCRIPT SQL PYTHON PHP BOOTSTRAP JAVA JQUERY R React Kotlin

Why is it possible to have "serialized results of n tasks (XXXX MB)" be greater than `spark.driver.memory` in pyspark?

Tags:

jvm

buffer

apache-spark

pyspark

cluster-computing

I launched a spark job with these settings (among others):

spark.driver.maxResultSize  11GB
spark.driver.memory         12GB

I was debugging my pyspark job, and it kept giving me the error:

serialized results of 16 tasks (17.4 GB) is bigger than spark.driver.maxResultSize (11 GB)

So, I increased the spark.driver.maxResultSize to 18 G in the configuration settings. And, it worked!!

Now, this is interesting because in both cases the spark.driver.memory was SMALLER than the serialized results returned.

Why is this allowed? I would assume this not to be possible because the serialized results were 17.4 GB when I was debugging, which is more than the size of the driver, which is 12 GB, as shown above?

How is this possible?

like image

985

asked Jul 17 '16 01:07

makansij

1 Answers

It is possible because spark.driver.memory configures JVM driver process not Python interpreter and data between them is transferred with sockets and driver process don't have to keep all data in memory (don't convert to local structure).

like image

105

answered Nov 09 '22 04:11

user6022341

Sign in to Comment

Related questions
                            
                                Spark SQL exception handling
                            
                                Spark driver pod getting killed with 'OOMKilled' status
                            
                                Is Tachyon by default implemented by the RDD's in Apache Spark?
                            
                                Spark DataFrame: operate on groups
                            
                                pyspark : how to check if a file exists in hdfs
                            
                                Scope of 'spark.driver.maxResultSize'
                            
                                Making spark use /etc/hosts file for binding in YARN cluster mode
                            
                                Spark serialization error mystery
                            
                                Spark: More Efficient Aggregation to join strings from different rows
                            
                                Spark SQL performance: version 1.6 vs version 1.5
                            
                                What's the limit to spark streaming in terms of data amount?
                            
                                Jupyter & PySpark: How to run multiple notebooks
                            
                                how to read and write to the same file in spark using parquet?
                            
                                Writing From Spark to DynamoDB
                            
                                Is there a Spark SQL jdbc driver?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With