Running into 'java.lang.OutOfMemoryError: Java heap space' when using toPandas() and databricks connect

Tags:

I'm trying to transform a pyspark dataframe of size [2734984 rows x 11 columns] to a pandas dataframe calling toPandas(). Whereas it is working totally fine (11 seconds) when using an Azure Databricks Notebook, I run into a java.lang.OutOfMemoryError: Java heap space exception when i run the exact same code using databricks-connect (db-connect version and Databricks Runtime Version match and are both 7.1).

I already increased the spark driver memory (100g) and the maxResultSize (15g). I suppose that the error lies somewhere in databricks-connect because I cannot replicate it using the Notebooks.

Any hint what's going on here?

The error is the following one:

Exception in thread "serve-Arrow" java.lang.OutOfMemoryError: Java heap space
    at com.ning.compress.lzf.ChunkDecoder.decode(ChunkDecoder.java:51)
    at com.ning.compress.lzf.LZFDecoder.decode(LZFDecoder.java:102)
    at com.databricks.service.SparkServiceRPCClient.executeRPC0(SparkServiceRPCClient.scala:84)
    at com.databricks.service.SparkServiceRemoteFuncRunner.withRpcRetries(SparkServiceRemoteFuncRunner.scala:234)
    at com.databricks.service.SparkServiceRemoteFuncRunner.executeRPC(SparkServiceRemoteFuncRunner.scala:156)
    at com.databricks.service.SparkServiceRemoteFuncRunner.executeRPCHandleCancels(SparkServiceRemoteFuncRunner.scala:287)
    at com.databricks.service.SparkServiceRemoteFuncRunner.$anonfun$execute0$1(SparkServiceRemoteFuncRunner.scala:118)
    at com.databricks.service.SparkServiceRemoteFuncRunner$$Lambda$934/2145652039.apply(Unknown Source)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at com.databricks.service.SparkServiceRemoteFuncRunner.withRetry(SparkServiceRemoteFuncRunner.scala:135)
    at com.databricks.service.SparkServiceRemoteFuncRunner.execute0(SparkServiceRemoteFuncRunner.scala:113)
    at com.databricks.service.SparkServiceRemoteFuncRunner.$anonfun$execute$1(SparkServiceRemoteFuncRunner.scala:86)
    at com.databricks.service.SparkServiceRemoteFuncRunner$$Lambda$1031/465320026.apply(Unknown Source)
    at com.databricks.spark.util.Log4jUsageLogger.recordOperation(UsageLogger.scala:210)
    at com.databricks.spark.util.UsageLogging.recordOperation(UsageLogger.scala:346)
    at com.databricks.spark.util.UsageLogging.recordOperation$(UsageLogger.scala:325)
    at com.databricks.service.SparkServiceRPCClientStub.recordOperation(SparkServiceRPCClientStub.scala:61)
    at com.databricks.service.SparkServiceRemoteFuncRunner.execute(SparkServiceRemoteFuncRunner.scala:78)
    at com.databricks.service.SparkServiceRemoteFuncRunner.execute$(SparkServiceRemoteFuncRunner.scala:67)
    at com.databricks.service.SparkServiceRPCClientStub.execute(SparkServiceRPCClientStub.scala:61)
    at com.databricks.service.SparkServiceRPCClientStub.executeRDD(SparkServiceRPCClientStub.scala:225)
    at com.databricks.service.SparkClient$.executeRDD(SparkClient.scala:279)
    at com.databricks.spark.util.SparkClientContext$.executeRDD(SparkClientContext.scala:161)
    at org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:864)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:928)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2331)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2426)
    at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$6(Dataset.scala:3638)
    at org.apache.spark.sql.Dataset$$Lambda$3567/1086808304.apply$mcV$sp(Unknown Source)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
    at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$3(Dataset.scala:3642)```

482

asked Dec 09 '20 17:12

petzholt

1 Answers

This is likely because Databricks-connect is executing the toPandas on the client machine which can then run out of memory. You could increase the local driver memory by setting spark.driver.memory in the (local) config file ${spark_home}/conf/spark-defaults.conf where ${spark_home} can be obtained with databricks-connect get-spark-home.

130

answered Nov 15 '22 08:11

sander-db

Related questions
                            
                                Find the substring avoiding the use of recursive function
                            
                                Why is Python's built-in sum much slower than manual summation?
                            
                                Generate video from numpy arrays with openCV
                            
                                Replace a list of characters with indices in a string in python
                            
                                On a django site I am getting socket cluster error
                            
                                How do you make pylint in VSCode know that it's in a package (so that relative imports work)?
                            
                                Python: Dynamically create class while providing arguments to __init__subclass__()
                            
                                Calculate intersection over union (Jaccard's index) in pandas dataframe
                            
                                botocore.exceptions.SSLError: SSL validation failed on WIndows
                            
                                Have unique index value in Pandas DataFrame
                            
                                Where should I put abstract classes in a python package?
                            
                                What shebang should I use to consistently point to python3?
                            
                                Get starlette request body in the middleware context
                            
                                Replace a pandas column by splitting the text based on "_"
                            
                                Add missing rows based on column
                            
                                How to make text processing in a pandas df column more faster for large textual data?
                            
                                InvalidArgumentError: Specified a list with shape [60,9] from a tensor with shape [56,9]
                            
                                Fine-tune Bert for specific domain (unsupervised)
                            
                                Can Homebrew run on Apple ARM processors?
                            
                                How can I properly run 2 threads that await things at the same time?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Running into 'java.lang.OutOfMemoryError: Java heap space' when using toPandas() and databricks connect

Tags:

python

pandas

pyspark

databricks

databricks-connect

petzholt

People also ask

1 Answers

sander-db

Recent Activity

Donate For Us