Spark 2.2.0 FileOutputCommitter

Tags:

DirectFileOutputCommitter is no longer available in Spark 2.2.0. This means writing to S3 takes insanely long time (3 hours vs 2 mins). I'm able to work around this by setting FileOutputCommitter version to 2 in spark-shell by doing this,

spark-shell --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2

same does not work with spark-sql

spark-sql --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2

The above command seems to be setting the version=2 but when the query is exeucted it still shows version 1 behaviour.

Two questions,

1) How do I get FileOutputCommitter version 2 behaviour with spark-sql?

2) Is there a way I can still use DirectFileOutputCommitter in spark 2.2.0? [I'm fine with non-zero chance of missing data]

user3279189

1 Answers

I have been hit by this issue. Spark is discouraging the usage of DirectFileOutputCommitter as it might lead to data loss in case of race situation. The algorithm version 2 doesn't help a lot.

I have tried to use the gzip to save the data in s3 instead of snappy compression which gave some benefit.

The real issue here is that spark writes in the s3://<output_directory>/_temporary/0 first then copies the data from temporary to the output. This process is pretty slow in s3,(Generally 6MBPS) So if you get lot of data you will get considerable slowdown.

The alternative is to write to HDFS first then use distcp / s3distcp to copy the data to s3.

Also , You could look for a solution Netflix provided.

I haven't evaluated that.

EDIT:

The new spark2.4 version has solved the problem of slow s3 write. I have found the s3 write performance of spark2.4 with hadoop 2.8 in the latest EMR version (5.24) is almost at par with HDFS write.

See the documents

https://aws.amazon.com/blogs/big-data/improve-apache-spark-write-performance-on-apache-parquet-formats-with-the-emrfs-s3-optimized-committer/
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-performance.html

answered Oct 17 '22 17:10

Avishek Bhattacharya

Related questions
                            
                                Hadoop streaming jobs SUCCEEDED but killed by ApplicationMaster
                            
                                how to efficiently move data from Kafka to an Impala table?
                            
                                ClassNotFoundException: org.apache.spark.SparkConf with spark on hive
                            
                                Why setConfiguration(Configuration conf) method of UserGroupInformation class is static?
                            
                                How many partitions does Spark create when a file is loaded from S3 bucket?
                            
                                Does Spark use data locality?
                            
                                How to make R tm corpus of 100 million tweets?
                            
                                Distinct on Multiple columns in Hive
                            
                                Java 8 MapReduce for distributed computing
                            
                                Why 'mapred-site.xml' is not included in the latest Hadoop 2.2.0?
                            
                                Using spark-submit, what is the behavior of the --total-executor-cores option?
                            
                                Spark on Windows - What exactly is winutils and why do we need it?
                            
                                Java daemons launched with multiple -Xmx option (hadoop)
                            
                                How to append to an hdfs file on an extremely small cluster (3 nodes or less)
                            
                                How to use MATLAB code in mapper (Hadoop)?
                            
                                How do you use MapReduce/Hadoop? [closed]
                            
                                Looking for a drop-in replacement for a java.util.Map
                            
                                Join vs COGROUP in PIG
                            
                                How to allow spark to ignore missing input files?
                            
                                Any way to compute statistics on a hive table for all partitions with a single analyze command?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark 2.2.0 FileOutputCommitter

Tags:

amazon-s3

apache-spark

apache-spark-sql

hadoop

amazon-emr

user3279189

People also ask

1 Answers

Avishek Bhattacharya

Recent Activity

Donate For Us