Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Databricks Exception: Total size of serialized results is bigger than spark.driver.maxResultsSize

I'm running a code in Apache Spark on Azure that converts over 3 million XML-files into one CSV-file. I get the following error when I want to do this:

org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 1408098 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)

I know what the error means in general, but I don't know what it means in my case and I don't understand how to solve this.

The code is:

All XML files are loaded:

df = spark.read.format('com.databricks.spark.xml').option("rowTag", "ns0:TicketScan").load('LOCATION/*.xml')

All loaded files are put into a CSV-file:

 def saveDfToCsv(df, tsvOutput):
  tmpParquetDir = "dbfs:/tmp/mart1.tmp.csv"
  dbutils.fs.rm(tmpParquetDir, True)
  df.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save(tmpParquetDir)
  src = filter(lambda x: "part-00000" in x.name, dbutils.fs.ls('dbfs:/tmp/mart1.tmp.csv'))[0].path
  dbutils.fs.mv(src, tsvOutput)

saveDfToCsv(df, 'LOCATION/database.csv')

I hope my question is clear enough. If not, please allow me to explain it further.

I hope someone can help me.

Best regards.

like image 464
Ganesh Gebhard Avatar asked Oct 30 '18 15:10

Ganesh Gebhard


People also ask

How do I change the spark driver maxResultSize in Databricks?

Go into the cluster settings, under Advanced select spark and paste spark. driver. maxResultSize 0 (for unlimited) or whatever the value suits you.

What is spark driver maxResultSize?

spark. driver. maxResultSize. Sets a limit on the total size of serialized results of all partitions for each Spark action (such as collect ). Jobs will fail if the size of the results exceeds this limit; however, a high limit can cause out-of-memory errors in the driver.

What is the difference between driver and executor in spark?

The executors are responsible for actually executing the work that the driver assigns them. This means, each executor is responsible for only two things: executing code assigned to it by the driver and reporting the state of the computation, on that executor, back to the driver node.


1 Answers

You need to change this parameter in the cluster configuration. Go into the cluster settings, under Advanced select spark and paste spark.driver.maxResultSize 0 (for unlimited) or whatever the value suits you. Using 0 is not recommended. You should optimize the job by re partitioning.

like image 76
Salman Ghauri Avatar answered Nov 03 '22 21:11

Salman Ghauri