Spark - Reading many small parquet files gets status of each file before hand

Tags:

I have hundreds of thousands of smaller parquet files I'm attempting to read in with Spark on a regular basis. My application runs, but before the files are read in using the executor nodes, the driver node appears to be getting the status of each individual file. I read into it a bit and this is necessary to infer the schema and partitions. I tried providing them as so:

sparkSession.baseRelationToDataFrame(
  DataSource
    .apply(
      sparkSession,
      paths = paths, // List of thousands of parquet files in S3
      partitionColumns = Seq("my_join_column"),
      userSpecifiedSchema = Some(schema),
      className = "parquet",
      options = Seq().toMap
    )
    .resolveRelation(checkFilesExist = false)
)

But even when providing the schema and partition columns, it takes a while before hand. After looking into the resolveRelation code a bit, it looks like it still has to query the status of each file in order to build an InMemoryFileIndex.

Is there any way to get around this issue?

I'm using spark-sql 2.3.1.

545

asked Nov 02 '18 17:11

Sam

1 Answers

There is no good way to avoid this problem in the current Spark architecture.

A while back I collaborated with some Spark committers on a LazyBaseRelation design that can delay discovering file information until the number of partitions--as opposed to just the schema--of a data source must be known, which isn't technically necessary until an action has to be run, but we never completed the work. Even then, when the time comes to execute an action, you'd take the hit.

There are four practical approaches to speeding the initial file discovery:

Use a large cluster, as some aspects of file discovery are distributed. In some environments you can scale a cluster down once discovery is complete.
Do the initial discovery before you need to use the data, in order to, hopefully, have it available by the time you need it. We have petabytes of data in millions of large Parquet files with three levels of partitioning. We use a scheduled job to refresh the in-memory file index.
If you are on Databricks, use Delta's OPTIMIZE to coalesce the small Parquet files into fewer, larger ones. Note that Delta costs extra.
Implement the equivalent of OPTIMIZE by yourself, rewriting subsets of the data. Whether you can do this easily or not depends on access patterns: you have to think about idempotence and consistency.

Once initial discovery is done, caching the in-memory file list is your best friend. There are two ways of doing it:

Use the metastore, by registering your data as an external table. Whether you can do this easily or not depends on data update patterns. If the data is naturally partitioned you can add partitions using DDL and you can easily implement strategy (4) above.
Build your own table manager. This is what we did as the metastore implementation had unacceptable restrictions on schema evolution. You'd have to decide on scoping: driver/JVM-and SparkSession are the two obvious choices.

Good luck!

answered Oct 13 '22 09:10

Sim

Related questions
                            
                                SBT build, run main class from subproject on Compile and run
                            
                                head :: tail pattern matching for strings
                            
                                Is there any trick to use macros in the same file they are defined?
                            
                                Extracting Raw JSON as String inside a Spray POST route
                            
                                Scala Typeclasses with generics
                            
                                In Play 2 how to check if a JsValue variable is NULL?
                            
                                Is Spark zipWithIndex safe with parallel implementation?
                            
                                What is the most concise way to increment a variable of type Short in Scala?
                            
                                spark submit java.lang.ClassNotFoundException
                            
                                Akka Http: Exceeded configured max-open-requests value of [32]
                            
                                Turn a side-effecting function returning Option into an Iterator
                            
                                Unit testing with Spark dataframes
                            
                                Implicit class vs Implicit conversion to trait
                            
                                How to use return of one gatling request into another request - Scala
                            
                                Why is this scala code not inferring type?
                            
                                What advantages does scala.util.Try have over try..catch?
                            
                                Is there a way to set the Scala version used in an Ammonite script?
                            
                                How to exclude jar in final sbt assembly plugin
                            
                                Understanding DAG in spark
                            
                                How to run two spark jobs in parallel in standalone mode [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark - Reading many small parquet files gets status of each file before hand

Tags:

amazon-s3

scala

apache-spark

apache-spark-sql

parquet

Sam

People also ask

1 Answers

Sim

Recent Activity

Donate For Us