I've seen a number of questions describing problems when working with S3 in Spark:
many specifically describing issues with Parquet files:
as well as some external sources referring to other issues with Spark - S3 - Parquet combinations. It makes me think that either S3 with Spark or this complete combination may not be the best choice.
Am I into something here? Can anyone provide an authoritative answer explaining:
A lot of the issues aren't parquet specific, but that S3 is not a filesystem, despite the APIs trying to make it look like this. Many nominally-low cost operations take multiple HTTPS requests, with the consequent delays.
Regarding JIRAs
rename()
to commit work is a killer. It's used at the end of tasks and jobs, and in checkpointing. The more output you generate, the longer things take to complete. The s3guard work will include a zero-rename committer, but it will take care and time to move things to it.Parquet? pushdown works, but there are a few other options to speed things up. I list them and others in: http://www.slideshare.net/steve_l/apache-spark-and-object-stores
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With