Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?

Recently we migrated from "EMR on HDFS" --> "EMR on S3" (EMRFS with consistent view enabled) and we realized the Spark 'SaveAsTable' (parquet format) writes to S3 were ~4x slower as compared to HDFS but we found a workaround of using the DirectParquetOutputCommitter -[1] w/ Spark 1.6.

Reason for S3 slowness - We had to pay the so called Parquet tax-[2] where the default output committer writes to a temporary table and renames it later where the rename operation in S3 is very expensive

Also we do understand the risk of using 'DirectParquetOutputCommitter' which is possibility of data corruption w/ speculative tasks enabled.

Now w/ Spark 2.0 this class has been deprecated and we're wondering what options do we have on the table so that we don't get to bear the ~4x slower writes when we upgrade to Spark 2.0. Any Thoughts/suggestions/recommendations would be highly appreciated.

One workaround that we can think of is - Save on HDFS and then copy it to S3 via s3DistCp (any thoughts on how can this be done in sane way as our Hive metadata-store points to S3?)

Also looks like NetFlix has fixed this -[3], any idea on when they're planning to open source it?

Thanks.

[1] - https://github.com/apache/spark/blob/21d5ca128bf3afd5c2d4c7fcc56240e28443474f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/DirectParquetOutputCommitter.scala

[2] - https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/

[3] - https://www.youtube.com/watch?v=85sew9OFaYc&feature=youtu.be&t=8m39s http://www.slideshare.net/AmazonWebServices/bdt303-running-spark-and-presto-on-the-netflix-big-data-platform

like image 753
anivohra Avatar asked Sep 22 '16 04:09

anivohra


2 Answers

You can use: sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

since you are on EMR just use s3 (no need for s3a)

We are using Spark 2.0 and writing Parquet to S3 pretty fast (about as fast as HDFS)

if you want to read more check out this jira ticket SPARK-10063

like image 135
Tal Joffe Avatar answered Nov 20 '22 05:11

Tal Joffe


I think the S3 committer from Netflix is already open sourced at: https://github.com/rdblue/s3committer.

like image 40
viirya Avatar answered Nov 20 '22 03:11

viirya