I am running <code>spark job</code> in a cluster which has 2 worker nodes! I am using the code below (spark java) for saving the computed dataframe as csv to worker nodes. <code>dataframe.write().option("header","false").mode(SaveMode.Overwrite).csv(outputDirPath); </code> I am trying to understand how spark writes multiple part files on each worker node. Run1) <code>worker1</code> has <code>part files</code> and <code>SUCCESS</code> ; <code>worker2</code> has <code>_temporarty/task*/part*</code> each task has the part files run. Run2) <code>worker1</code> has part files and also <code>_temporary</code> directory; <code>worker2</code> has <code>multiple part files</code> Can anyone help me understand why is this behavior? 1)Should I consider the records in <code>outputDir/_temporary</code> as part of the output file along with the <code>part files in outputDir</code>? 2)Is <code>_temporary</code> dir supposed to be deleted after job run and move the <code>part</code> files to <code>outputDir</code>? 3)why can't it create part files directly under ouput dir? <code>coalesce(1)</code> and <code>repartition(1)</code> cannot be the option since the outputDir file itself will be around <code>500GB</code> <code>Spark 2.0.2. 2.1.3</code> and <code>Java 8, no HDFS</code>

After analysis, observed that my spark job is using <code>fileoutputcommitter version 1</code> which is default. Then I included config to use <code>fileoutputcommitter version 2</code> instead of <code>version 1</code> and tested in 10 node spark standalone cluster in AWS. All <code>part-* files</code> are generated directly under <code>outputDirPath</code> specified in the <code>dataframe.write().option("header","false").mode(SaveMode.Overwrite).csv(outputDirPath)</code> We can set the property <ol> <li>By including the same as <code>--conf 'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2'</code> in <code>spark-submit command</code></li> <li>or set the property using sparkContext <code>javaSparkContext.hadoopConifiguration().set("mapreduce.fileoutputcommitter.algorithm.version","2")</code></li> </ol> I understand the consequence in case of failures as outlined in the spark docs, but I achieved the desired result! <blockquote> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version, defaultValue is 1 The file output committer algorithm version, valid algorithm version number: 1 or 2. Version 2 may have better performance, but version 1 may handle failures better in certain situations, as per MAPREDUCE-4815. </blockquote>

Multiple part files are based on your dataframe partition. The number of files or data written is dependent on the number of partitions the DataFrame has at the time you write out the data. By default, one file is written per partition of the data. you can control it by using coalesce or repartition. you can reduce the partition or increase it. if you make coalesce of 1, then you wont see multiple part files in it but this affects writing Data in Parallel. [outputDirPath = /tmp/multiple.csv ] <pre class="prettyprint"><code>dataframe .coalesce(1) .write.option("header","false") .mode(SaveMode.Overwrite) .csv(outputDirPath); </code></pre> on your question on how to refer it.. refer as <code>/tmp/multiple.csv</code> for all below parts. <pre class="prettyprint"><code>/tmp/multiple.csv/part-00000.csv /tmp/multiple.csv/part-00001.csv /tmp/multiple.csv/part-00002.csv /tmp/multiple.csv/part-00003.csv </code></pre>

Spark Dataframe Write to CSV creates _temporary directory file in Standalone Cluster Mode

3 Answers

TL;DR To properly write (or read for that matter) data using file system based source you'll need a shared storage.

_temporary directory is a part of basic commit mechanism used by Spark - data is first written to a temporary directory, and once all task finished, atomically moved to the final destination. You can read more about this process in Spark _temporary creation reason

For this process to be successful you need a shared file system (HDFS, NFS, and so on) or equivalent distributed storage (like S3). Since you don't have one, failure to clean temporary state is expected - Saving dataframe to local file system results in empty results.

The behavior you observed (data partially committed and partially not) can occur, when some executors are co-located with the driver and share file system with the driver, enabling full commit for the subset of data.

120

answered Oct 06 '22 14:10

2 revs

After analysis, observed that my spark job is using fileoutputcommitter version 1 which is default. Then I included config to use fileoutputcommitter version 2 instead of version 1 and tested in 10 node spark standalone cluster in AWS. All part-* files are generated directly under outputDirPath specified in the dataframe.write().option("header","false").mode(SaveMode.Overwrite).csv(outputDirPath)

We can set the property

By including the same as --conf 'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2' in spark-submit command
or set the property using sparkContext javaSparkContext.hadoopConifiguration().set("mapreduce.fileoutputcommitter.algorithm.version","2")

I understand the consequence in case of failures as outlined in the spark docs, but I achieved the desired result!

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version, defaultValue is 1
The file output committer algorithm version, valid algorithm version number: 1 or 2. Version 2 may have better performance, but version 1 may handle failures better in certain situations, as per MAPREDUCE-4815.

answered Oct 06 '22 15:10

Omkar Puttagunta

Multiple part files are based on your dataframe partition. The number of files or data written is dependent on the number of partitions the DataFrame has at the time you write out the data. By default, one file is written per partition of the data.

you can control it by using coalesce or repartition. you can reduce the partition or increase it.

if you make coalesce of 1, then you wont see multiple part files in it but this affects writing Data in Parallel.

[outputDirPath = /tmp/multiple.csv ]

dataframe
 .coalesce(1)
 .write.option("header","false")
 .mode(SaveMode.Overwrite)
 .csv(outputDirPath);

on your question on how to refer it..

refer as /tmp/multiple.csv for all below parts.

/tmp/multiple.csv/part-00000.csv
/tmp/multiple.csv/part-00001.csv
/tmp/multiple.csv/part-00002.csv
/tmp/multiple.csv/part-00003.csv

answered Oct 06 '22 15:10

Karthick

Related questions
                            
                                How to validate complex JSONObjects in a JUnit test?
                            
                                How to call two or more web services or REST in parallel with Project Reactor and join the answers
                            
                                Getting java.lang.IllegalStateException: Logback configuration error detected error
                            
                                XXXSummaryStatistics new constructor in java-10
                            
                                Java8 filter and return if only element
                            
                                Using S3 Java SDK to talk to S3 compatible storage (minio)
                            
                                How to use ConstraintSet for animation in Android with Java?
                            
                                How do I validate a jwt token that I got from Cognito
                            
                                Spring Data cannot fetch a record using UUID in postgresql
                            
                                Can't run Eclipse with Java 10.0.1
                            
                                Adding arrays as Kubernetes environment variables
                            
                                TimeUnit conversion from Milliseconds to Days not working for me
                            
                                Jackson parsing exception -(although at least one Creator exists): no String-argument constructor/factory method to deserialize from String value
                            
                                Using Spring security oauth, using a custom OAuth provider, I get [authorization_request_not_found], should I handle the callback method myself?
                            
                                Maven shade plugin remove "original"
                            
                                Unable to install java9 on ubuntu
                            
                                Check when smoothScrollToPosition has finished
                            
                                IllegalAccessError: Method is inaccessible to class
                            
                                How to use scope "test" and junit correctly with maven and eclipse
                            
                                How to inject a bean into a Spring Condition class?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Dataframe Write to CSV creates _temporary directory file in Standalone Cluster Mode

Tags:

java

dataframe

csv

apache-spark

apache-spark-sql

Omkar Puttagunta

People also ask

3 Answers

2 revs

Omkar Puttagunta

Karthick

Recent Activity

Donate For Us