Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark 2.2.0 FileOutputCommitter

DirectFileOutputCommitter is no longer available in Spark 2.2.0. This means writing to S3 takes insanely long time (3 hours vs 2 mins). I'm able to work around this by setting FileOutputCommitter version to 2 in spark-shell by doing this,

spark-shell --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 

same does not work with spark-sql

spark-sql --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 

The above command seems to be setting the version=2 but when the query is exeucted it still shows version 1 behaviour.

Two questions,

1) How do I get FileOutputCommitter version 2 behaviour with spark-sql?

2) Is there a way I can still use DirectFileOutputCommitter in spark 2.2.0? [I'm fine with non-zero chance of missing data]

Related items:

Spark 1.6 DirectFileOutputCommitter

like image 400
user3279189 Avatar asked Sep 17 '17 07:09

user3279189


People also ask

Which version of Apache Spark supports R programming language?

Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark 2.2.

What is spark language used for?

SPARK is a formally defined computer programming language based on the Ada programming language, intended for the development of high integrity software used in systems where predictable and highly reliable operation is essential.

Which kind of data can be processed by Spark?

Spark Streaming – This component allows Spark to process real-time streaming data. Data can be ingested from many sources like Kafka, Flume, and HDFS (Hadoop Distributed File System). Then the data can be processed using complex algorithms and pushed out to file systems, databases, and live dashboards.


1 Answers

I have been hit by this issue. Spark is discouraging the usage of DirectFileOutputCommitter as it might lead to data loss in case of race situation. The algorithm version 2 doesn't help a lot.

I have tried to use the gzip to save the data in s3 instead of snappy compression which gave some benefit.

The real issue here is that spark writes in the s3://<output_directory>/_temporary/0 first then copies the data from temporary to the output. This process is pretty slow in s3,(Generally 6MBPS) So if you get lot of data you will get considerable slowdown.

The alternative is to write to HDFS first then use distcp / s3distcp to copy the data to s3.

Also , You could look for a solution Netflix provided.

I haven't evaluated that.

EDIT:

The new spark2.4 version has solved the problem of slow s3 write. I have found the s3 write performance of spark2.4 with hadoop 2.8 in the latest EMR version (5.24) is almost at par with HDFS write.

See the documents

  1. https://aws.amazon.com/blogs/big-data/improve-apache-spark-write-performance-on-apache-parquet-formats-with-the-emrfs-s3-optimized-committer/

  2. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-performance.html

like image 65
Avishek Bhattacharya Avatar answered Oct 17 '22 17:10

Avishek Bhattacharya