Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running into 'java.lang.OutOfMemoryError: Java heap space' when using toPandas() and databricks connect

I'm trying to transform a pyspark dataframe of size [2734984 rows x 11 columns] to a pandas dataframe calling toPandas(). Whereas it is working totally fine (11 seconds) when using an Azure Databricks Notebook, I run into a java.lang.OutOfMemoryError: Java heap space exception when i run the exact same code using databricks-connect (db-connect version and Databricks Runtime Version match and are both 7.1).

I already increased the spark driver memory (100g) and the maxResultSize (15g). I suppose that the error lies somewhere in databricks-connect because I cannot replicate it using the Notebooks.

Any hint what's going on here?

The error is the following one:

Exception in thread "serve-Arrow" java.lang.OutOfMemoryError: Java heap space
    at com.ning.compress.lzf.ChunkDecoder.decode(ChunkDecoder.java:51)
    at com.ning.compress.lzf.LZFDecoder.decode(LZFDecoder.java:102)
    at com.databricks.service.SparkServiceRPCClient.executeRPC0(SparkServiceRPCClient.scala:84)
    at com.databricks.service.SparkServiceRemoteFuncRunner.withRpcRetries(SparkServiceRemoteFuncRunner.scala:234)
    at com.databricks.service.SparkServiceRemoteFuncRunner.executeRPC(SparkServiceRemoteFuncRunner.scala:156)
    at com.databricks.service.SparkServiceRemoteFuncRunner.executeRPCHandleCancels(SparkServiceRemoteFuncRunner.scala:287)
    at com.databricks.service.SparkServiceRemoteFuncRunner.$anonfun$execute0$1(SparkServiceRemoteFuncRunner.scala:118)
    at com.databricks.service.SparkServiceRemoteFuncRunner$$Lambda$934/2145652039.apply(Unknown Source)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at com.databricks.service.SparkServiceRemoteFuncRunner.withRetry(SparkServiceRemoteFuncRunner.scala:135)
    at com.databricks.service.SparkServiceRemoteFuncRunner.execute0(SparkServiceRemoteFuncRunner.scala:113)
    at com.databricks.service.SparkServiceRemoteFuncRunner.$anonfun$execute$1(SparkServiceRemoteFuncRunner.scala:86)
    at com.databricks.service.SparkServiceRemoteFuncRunner$$Lambda$1031/465320026.apply(Unknown Source)
    at com.databricks.spark.util.Log4jUsageLogger.recordOperation(UsageLogger.scala:210)
    at com.databricks.spark.util.UsageLogging.recordOperation(UsageLogger.scala:346)
    at com.databricks.spark.util.UsageLogging.recordOperation$(UsageLogger.scala:325)
    at com.databricks.service.SparkServiceRPCClientStub.recordOperation(SparkServiceRPCClientStub.scala:61)
    at com.databricks.service.SparkServiceRemoteFuncRunner.execute(SparkServiceRemoteFuncRunner.scala:78)
    at com.databricks.service.SparkServiceRemoteFuncRunner.execute$(SparkServiceRemoteFuncRunner.scala:67)
    at com.databricks.service.SparkServiceRPCClientStub.execute(SparkServiceRPCClientStub.scala:61)
    at com.databricks.service.SparkServiceRPCClientStub.executeRDD(SparkServiceRPCClientStub.scala:225)
    at com.databricks.service.SparkClient$.executeRDD(SparkClient.scala:279)
    at com.databricks.spark.util.SparkClientContext$.executeRDD(SparkClientContext.scala:161)
    at org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:864)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:928)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2331)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2426)
    at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$6(Dataset.scala:3638)
    at org.apache.spark.sql.Dataset$$Lambda$3567/1086808304.apply$mcV$sp(Unknown Source)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
    at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$3(Dataset.scala:3642)```
like image 482
petzholt Avatar asked Dec 09 '20 17:12

petzholt


People also ask

How do you deal with Java Lang OutOfMemoryError Java heap space error?

OutOfMemoryError: Java heap space. 1) An easy way to solve OutOfMemoryError in java is to increase the maximum heap size by using JVM options "-Xmx512M", this will immediately solve your OutOfMemoryError.

What causes Java Lang OutOfMemoryError Java heap space?

lang. OutOfMemoryError exception. Usually, this error is thrown when there is insufficient space to allocate an object in the Java heap. In this case, The garbage collector cannot make space available to accommodate a new object, and the heap cannot be expanded further.

How do I resolve heap memory problems?

There are several ways to eliminate a heap memory issue: Increase the maximum amount of heap available to the VM using the -Xmx VM argument. Use partitioning to distribute the data over additional machines. Overflow or expire the region data to reduce the heap memory footprint of the regions.


1 Answers

This is likely because Databricks-connect is executing the toPandas on the client machine which can then run out of memory. You could increase the local driver memory by setting spark.driver.memory in the (local) config file ${spark_home}/conf/spark-defaults.conf where ${spark_home} can be obtained with databricks-connect get-spark-home.

like image 130
sander-db Avatar answered Nov 15 '22 08:11

sander-db