Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

is Parquet predicate pushdown works on S3 using Spark non EMR?

Just wondering if Parquet predicate pushdown also works on S3, not only HDFS. Specifically if we use Spark (non EMR).

Further explanation might be helpful since it might involve understanding on distributed file system.

like image 608
rendybjunior Avatar asked Jan 21 '16 07:01

rendybjunior


People also ask

Does parquet support predicate pushdown?

Parquet allows for predicate pushdown filtering, a form of query pushdown because the file footer stores row-group level metadata for each column in the file.

How predicate pushdown works Spark?

A predicate push down filters the data in the database query, reducing the number of entries retrieved from the database and improving query performance. By default the Spark Dataset API will automatically push down valid WHERE clauses to the database.

Does Avro support predicate pushdown?

Predicate pushdown is a data processing technique taking user-defined filters and executing them while reading the data. Apache Spark already supported it for Apache Parquet and RDBMS. Starting from Apache Spark 3.1. 1, you can also use them for Apache Avro, JSON and CSV formats!

Does Spark support Amazon S3?

With Amazon EMR release version 5.17. 0 and later, you can use S3 Select with Spark on Amazon EMR.


1 Answers

I was wondering this myself so I just tested it out. We use EMR clusters and Spark 1.6.1 .

  • I generated some dummy data in Spark and saved it as a parquet file locally as well as on S3.
  • I created multiple Spark jobs with different kind of filters and column selections. I ran these tests once for the local file and once for the S3 file.
  • I then used the Spark History Server to see how much data each job had as input.

Results:

  • For the local parquet file: The results showed that the column selection and filters were pushed down to the read as the input size was reduced when the job contained filters or column selection.
  • For the S3 parquet file: The input size was always the same as the Spark job that processed all of the data. None of the filters or column selections were pushed down to the read. The parquet file was always completely loaded from S3. Even though the query plan (.queryExecution.executedPlan) showed that the filters were pushed down.

I will add more details about the tests and results when I have time.

like image 54
user1355682 Avatar answered Sep 20 '22 10:09

user1355682