I'm running a code in Apache Spark on Azure that converts over 3 million XML-files into one CSV-file. I get the following error when I want to do this:
org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 1408098 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)
I know what the error means in general, but I don't know what it means in my case and I don't understand how to solve this.
The code is:
df = spark.read.format('com.databricks.spark.xml').option("rowTag", "ns0:TicketScan").load('LOCATION/*.xml')
def saveDfToCsv(df, tsvOutput):
tmpParquetDir = "dbfs:/tmp/mart1.tmp.csv"
dbutils.fs.rm(tmpParquetDir, True)
df.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save(tmpParquetDir)
src = filter(lambda x: "part-00000" in x.name, dbutils.fs.ls('dbfs:/tmp/mart1.tmp.csv'))[0].path
dbutils.fs.mv(src, tsvOutput)
saveDfToCsv(df, 'LOCATION/database.csv')
I hope my question is clear enough. If not, please allow me to explain it further.
I hope someone can help me.
Best regards.
Go into the cluster settings, under Advanced select spark and paste spark. driver. maxResultSize 0 (for unlimited) or whatever the value suits you.
spark. driver. maxResultSize. Sets a limit on the total size of serialized results of all partitions for each Spark action (such as collect ). Jobs will fail if the size of the results exceeds this limit; however, a high limit can cause out-of-memory errors in the driver.
The executors are responsible for actually executing the work that the driver assigns them. This means, each executor is responsible for only two things: executing code assigned to it by the driver and reporting the state of the computation, on that executor, back to the driver node.
You need to change this parameter in the cluster configuration. Go into the cluster settings, under Advanced select spark and paste spark.driver.maxResultSize 0
(for unlimited) or whatever the value suits you. Using 0 is not recommended. You should optimize the job by re partitioning.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With