Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Support for Parquet as an input / output format when working with S3

I've seen a number of questions describing problems when working with S3 in Spark:

  • Spark jobs finishes but application takes time to close
  • spark-1.4.1 saveAsTextFile to S3 is very slow on emr-4.0.0
  • Writing Spark checkpoints to S3 is too slow

many specifically describing issues with Parquet files:

  • Slow or incomplete saveAsParquetFile from EMR Spark to S3
  • Does Spark support Partition Pruning with Parquet Files
  • is Parquet predicate pushdown works on S3 using Spark non EMR?
  • Huge delays translating the DAG to tasks
  • Fast Parquet row count in Spark

as well as some external sources referring to other issues with Spark - S3 - Parquet combinations. It makes me think that either S3 with Spark or this complete combination may not be the best choice.

Am I into something here? Can anyone provide an authoritative answer explaining:

  • Current state of the Parquet support with focus on S3.
  • Can Spark (SQL) fully take advantage of Parquet features like partition pruning, predicate pushdown (including deeply nested schemas) and Parquet metadata Do all of these feature work as expected on S3 (or compatible storage solutions).
  • Ongoing developments and opened JIRA tickets.
  • Are there any configuration options which should be aware of when using these three together?
like image 829
user7337271 Avatar asked Oct 17 '22 19:10

user7337271


1 Answers

A lot of the issues aren't parquet specific, but that S3 is not a filesystem, despite the APIs trying to make it look like this. Many nominally-low cost operations take multiple HTTPS requests, with the consequent delays.

Regarding JIRAs

  • HADOOP-11694; S3A phase II —everything you will get in Hadoop 2.8. Much of this is already in HDP2.5, and yes, it has significant benefits.
  • HADOOP-13204: the todo list to follow.
  • Regarding spark (and hive), the use of rename() to commit work is a killer. It's used at the end of tasks and jobs, and in checkpointing. The more output you generate, the longer things take to complete. The s3guard work will include a zero-rename committer, but it will take care and time to move things to it.

Parquet? pushdown works, but there are a few other options to speed things up. I list them and others in: http://www.slideshare.net/steve_l/apache-spark-and-object-stores

like image 93
stevel Avatar answered Oct 21 '22 06:10

stevel