Databricks Exception: Total size of serialized results is bigger than spark.driver.maxResultsSize

All XML files are loaded:

df = spark.read.format('com.databricks.spark.xml').option("rowTag", "ns0:TicketScan").load('LOCATION/*.xml')

All loaded files are put into a CSV-file:

 def saveDfToCsv(df, tsvOutput):
  tmpParquetDir = "dbfs:/tmp/mart1.tmp.csv"
  dbutils.fs.rm(tmpParquetDir, True)
  df.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save(tmpParquetDir)
  src = filter(lambda x: "part-00000" in x.name, dbutils.fs.ls('dbfs:/tmp/mart1.tmp.csv'))[0].path
  dbutils.fs.mv(src, tsvOutput)

saveDfToCsv(df, 'LOCATION/database.csv')

I hope my question is clear enough. If not, please allow me to explain it further.

I hope someone can help me.

Best regards.

464

asked Oct 30 '18 15:10

Ganesh Gebhard

1 Answers

You need to change this parameter in the cluster configuration. Go into the cluster settings, under Advanced select spark and paste spark.driver.maxResultSize 0 (for unlimited) or whatever the value suits you. Using 0 is not recommended. You should optimize the job by re partitioning.

answered Nov 03 '22 21:11

Salman Ghauri

Related questions
                            
                                Twine hangs without prompting for password
                            
                                404 error when using Google App Engine with flask and flask-restplus
                            
                                Flask - job not running as a background process
                            
                                Understanding multivariate time series classification with Keras
                            
                                How can I specify the flatten layer input size after many conv layers in PyTorch?
                            
                                Why pandas has its own datetime object Timestamp?
                            
                                Python type hinting with db-api
                            
                                Kernel ridge and simple Ridge with Polynomial features
                            
                                Concurrency and Selenium - Multiprocessing vs Multithreading
                            
                                How to use Keras generator with tf.data API
                            
                                Callable is invalid base class?
                            
                                In jupyter notebook, pressing tab print "ipynb_checkpoints/" instead of auto-completion
                            
                                How to get all the models (one for each set of parameters) using GridSearchCV?
                            
                                Early Stopping with a Cross-Validated Metric in Keras
                            
                                The amount of memory a Python set spends increases in steps
                            
                                Plot.ly: Different height for subplots with shared X-Axes
                            
                                BoostPython and CMake
                            
                                chunk topandas from spark dataframe
                            
                                Get last Twitter mention from API with Tweepy avoiding rate limit
                            
                                Using Python to analyze large set of sensor-data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Databricks Exception: Total size of serialized results is bigger than spark.driver.maxResultsSize

Tags:

python

apache-spark

azure

databricks