Does Spark support true column scans over parquet files in S3?

Tags:

One of the great benefits of the Parquet data storage format is that it's columnar. If I've got a 'wide' dataset with hundreds of columns, but my query only touches a few of those, then it's possible read only the data that stores those few columns, and skip the rest.

Presumably this feature works by reading a bit of metadata at the head of a parquet file that indicates the locations on the filesystem for each column. The reader can then seek on disk to read in only the necessary columns.

Does anyone know whether spark's default parquet reader correctly implements this kind of selective seeking on S3? I think it's supported by S3, but there's a big difference between theoretical support and an implementation that properly exploits that support.

377

asked Sep 26 '16 12:09

conradlee

1 Answers

This needs to be broken down

Does the Parquet code get the predicates from spark (yes)
Does parquet then attempt to selectively read only those columns, using the Hadoop FileSystem seek() + read() or readFully(position, buffer, length) calls? Yes
Does the S3 connector translate these File Operations into efficient HTTP GET requests? In Amazon EMR: Yes. In Apache Hadoop, you need hadoop 2.8 on the classpath and set the properly spark.hadoop.fs.s3a.experimental.fadvise=random to trigger random access.

Hadoop 2.7 and earlier handle the aggressive seek() round the file badly, because they always initiate a GET offset-end-of-file, get surprised by the next seek, have to abort that connection, reopen a new TCP/HTTPS 1.1 connection (slow, CPU heavy), do it again, repeatedly. The random IO operation hurts on bulk loading of things like .csv.gz, but is critical to getting ORC/Parquet perf.

You don't get the speedup on Hadoop 2.7's hadoop-aws JAR. If you need it you need to update hadoop*.jar and dependencies, or build Spark up from scratch against Hadoop 2.8

Note that Hadoop 2.8+ also has a nice little feature: if you call toString() on an S3A filesystem client in a log statement, it prints out all the filesystem IO stats, including how much data was discarded in seeks, aborted TCP connections &c. Helps you work out what's going on.

2018-04-13 warning:: Do not try to drop the Hadoop 2.8+ hadoop-aws JAR on the classpath along with the rest of the hadoop-2.7 JAR set and expect to see any speedup. All you will see are stack traces. You need to update all the hadoop JARs and their transitive dependencies.

answered Oct 08 '22 19:10

stevel

Related questions
                            
                                Spark Equivalent of IF Then ELSE
                            
                                apache spark - check if file exists
                            
                                Would Spark unpersist the RDD itself when it realizes it won't be used anymore?
                            
                                Debugging "Managed memory leak detected" in Spark 1.6.0
                            
                                How to check status of Spark applications from the command line?
                            
                                Spark 2.0 Dataset vs DataFrame
                            
                                Methods for writing Parquet files using Python?
                            
                                Extremely slow S3 write times from EMR/ Spark
                            
                                The value of "spark.yarn.executor.memoryOverhead" setting?
                            
                                What are the differences between saveAsTable and insertInto in different SaveMode(s)?
                            
                                Create a custom Transformer in PySpark ML
                            
                                spark access first n rows - take vs limit
                            
                                When to cache a DataFrame?
                            
                                How do I read a parquet in PySpark written from Spark?
                            
                                How to create an empty DataFrame? Why "ValueError: RDD is empty"?
                            
                                get min and max from a specific column scala spark dataframe
                            
                                writing a csv with column names and reading a csv file which is being generated from a sparksql dataframe in Pyspark
                            
                                Spark Unable to find JDBC Driver
                            
                                Spark 2.0 missing spark implicits
                            
                                Use Spring together with Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does Spark support true column scans over parquet files in S3?

Tags:

amazon-s3

apache-spark

apache-spark-sql

parquet

conradlee

People also ask

1 Answers

stevel

Recent Activity

Donate For Us