I'm trying to transform a pyspark dataframe of size [2734984 rows x 11 columns] to a pandas dataframe calling toPandas()
. Whereas it is working totally fine (11 seconds) when using an Azure Databricks Notebook, I run into a java.lang.OutOfMemoryError: Java heap space
exception when i run the exact same code using databricks-connect (db-connect version and Databricks Runtime Version match and are both 7.1).
I already increased the spark driver memory (100g) and the maxResultSize (15g). I suppose that the error lies somewhere in databricks-connect because I cannot replicate it using the Notebooks.
Any hint what's going on here?
The error is the following one:
Exception in thread "serve-Arrow" java.lang.OutOfMemoryError: Java heap space
at com.ning.compress.lzf.ChunkDecoder.decode(ChunkDecoder.java:51)
at com.ning.compress.lzf.LZFDecoder.decode(LZFDecoder.java:102)
at com.databricks.service.SparkServiceRPCClient.executeRPC0(SparkServiceRPCClient.scala:84)
at com.databricks.service.SparkServiceRemoteFuncRunner.withRpcRetries(SparkServiceRemoteFuncRunner.scala:234)
at com.databricks.service.SparkServiceRemoteFuncRunner.executeRPC(SparkServiceRemoteFuncRunner.scala:156)
at com.databricks.service.SparkServiceRemoteFuncRunner.executeRPCHandleCancels(SparkServiceRemoteFuncRunner.scala:287)
at com.databricks.service.SparkServiceRemoteFuncRunner.$anonfun$execute0$1(SparkServiceRemoteFuncRunner.scala:118)
at com.databricks.service.SparkServiceRemoteFuncRunner$$Lambda$934/2145652039.apply(Unknown Source)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.service.SparkServiceRemoteFuncRunner.withRetry(SparkServiceRemoteFuncRunner.scala:135)
at com.databricks.service.SparkServiceRemoteFuncRunner.execute0(SparkServiceRemoteFuncRunner.scala:113)
at com.databricks.service.SparkServiceRemoteFuncRunner.$anonfun$execute$1(SparkServiceRemoteFuncRunner.scala:86)
at com.databricks.service.SparkServiceRemoteFuncRunner$$Lambda$1031/465320026.apply(Unknown Source)
at com.databricks.spark.util.Log4jUsageLogger.recordOperation(UsageLogger.scala:210)
at com.databricks.spark.util.UsageLogging.recordOperation(UsageLogger.scala:346)
at com.databricks.spark.util.UsageLogging.recordOperation$(UsageLogger.scala:325)
at com.databricks.service.SparkServiceRPCClientStub.recordOperation(SparkServiceRPCClientStub.scala:61)
at com.databricks.service.SparkServiceRemoteFuncRunner.execute(SparkServiceRemoteFuncRunner.scala:78)
at com.databricks.service.SparkServiceRemoteFuncRunner.execute$(SparkServiceRemoteFuncRunner.scala:67)
at com.databricks.service.SparkServiceRPCClientStub.execute(SparkServiceRPCClientStub.scala:61)
at com.databricks.service.SparkServiceRPCClientStub.executeRDD(SparkServiceRPCClientStub.scala:225)
at com.databricks.service.SparkClient$.executeRDD(SparkClient.scala:279)
at com.databricks.spark.util.SparkClientContext$.executeRDD(SparkClientContext.scala:161)
at org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:864)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:928)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2331)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2426)
at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$6(Dataset.scala:3638)
at org.apache.spark.sql.Dataset$$Lambda$3567/1086808304.apply$mcV$sp(Unknown Source)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$3(Dataset.scala:3642)```
OutOfMemoryError: Java heap space. 1) An easy way to solve OutOfMemoryError in java is to increase the maximum heap size by using JVM options "-Xmx512M", this will immediately solve your OutOfMemoryError.
lang. OutOfMemoryError exception. Usually, this error is thrown when there is insufficient space to allocate an object in the Java heap. In this case, The garbage collector cannot make space available to accommodate a new object, and the heap cannot be expanded further.
There are several ways to eliminate a heap memory issue: Increase the maximum amount of heap available to the VM using the -Xmx VM argument. Use partitioning to distribute the data over additional machines. Overflow or expire the region data to reduce the heap memory footprint of the regions.
This is likely because Databricks-connect is executing the toPandas on the client machine which can then run out of memory. You could increase the local driver memory by setting spark.driver.memory
in the (local) config file ${spark_home}/conf/spark-defaults.conf
where ${spark_home}
can be obtained with databricks-connect get-spark-home
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With