Just wondering if Parquet predicate pushdown also works on S3, not only HDFS. Specifically if we use Spark (non EMR). Further explanation might be helpful since it might involve understanding on distributed file system.

I was wondering this myself so I just tested it out. We use EMR clusters and Spark 1.6.1 . <ul> <li>I generated some dummy data in Spark and saved it as a parquet file locally as well as on S3. </li> <li>I created multiple Spark jobs with different kind of filters and column selections. I ran these tests once for the local file and once for the S3 file. </li> <li>I then used the Spark History Server to see how much data each job had as input.</li> </ul> Results: <ul> <li>For the local parquet file: The results showed that the column selection and filters were pushed down to the read as the input size was reduced when the job contained filters or column selection.</li> <li> For the S3 parquet file: The input size was always the same as the Spark job that processed all of the data. None of the filters or column selections were pushed down to the read. The parquet file was always completely loaded from S3. Even though the query plan (.queryExecution.executedPlan) showed that the filters were pushed down. </li> </ul> I will add more details about the tests and results when I have time.

is Parquet predicate pushdown works on S3 using Spark non EMR?

1 Answers

I was wondering this myself so I just tested it out. We use EMR clusters and Spark 1.6.1 .

I generated some dummy data in Spark and saved it as a parquet file locally as well as on S3.
I created multiple Spark jobs with different kind of filters and column selections. I ran these tests once for the local file and once for the S3 file.
I then used the Spark History Server to see how much data each job had as input.

Results:

For the local parquet file: The results showed that the column selection and filters were pushed down to the read as the input size was reduced when the job contained filters or column selection.
For the S3 parquet file: The input size was always the same as the Spark job that processed all of the data. None of the filters or column selections were pushed down to the read. The parquet file was always completely loaded from S3. Even though the query plan (.queryExecution.executedPlan) showed that the filters were pushed down.

I will add more details about the tests and results when I have time.

answered Sep 20 '22 10:09

user1355682

Related questions
                            
                                Is it possible to write to s3 via a stream using s3 java sdk
                            
                                Amazon Athena - Column cannot be resolved on basic SQL WHERE query
                            
                                Is there a way to see files stored in localstack's mocked S3 environment
                            
                                How do I update a batch of S3 objects' metadata using ruby?
                            
                                Errors::SignatureDoesNotMatch, AWS-SDK gem for S3 support on paperclip 3.0.1 and rails 3.2
                            
                                how to store scrapy images on Amazon S3?
                            
                                How to update S3 bucket with expire date using AWS CLI
                            
                                Serve static files in Flask from private AWS S3 bucket
                            
                                Access denied when put bucket policy on aws s3 bucket with root user (= bucket owner)
                            
                                Tools to Automate Amazon S3 backups from Windows Server [closed]
                            
                                Amazon AWS IOS SDK: How to list ALL file names in a FOLDER
                            
                                Parallel batch file download from Amazon S3 using AWS S3 SDK for .NET
                            
                                Empty folders in AWS S3
                            
                                how to upload files to s3 from aws cli with kms encryption
                            
                                Cannot load Rails.config.active_storage.service
                            
                                Folder won't delete on Amazon S3
                            
                                Spark jobs finishes but application takes time to close
                            
                                Can i point multiple location to same hive external table?
                            
                                Configuring environment variables for static web site on AWS S3 [closed]
                            
                                Download private file from S3 using bash

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

is Parquet predicate pushdown works on S3 using Spark non EMR?

Tags:

amazon-s3

apache-spark

parquet

rendybjunior

People also ask

1 Answers

user1355682

Recent Activity

Donate For Us