Spark predicate pushdown performance

Question

I have parquet files stored in partitions by date in directories like:

/activity
    /date=20180802

I'm using Spark 2.2 and there are 400+ partitions. My understanding is that predicate pushdown should allow me to run a query like the one below and get quick results.

spark.read.parquet(".../activity")
    .filter($"date" === "20180802" && $"id" === "58ff800af2")
    .show()

However, the query above is taking around 90 seconds while the query below takes around 5 seconds. Am I doing something wrong or is this expected behavior?

spark.read.parquet(".../activity/date=20180802")
    .filter($"id" === "58ff800af2")
    .show()

Powers · Accepted Answer

I noticed this too and talked about it at a Spark Summit presentation.

Spark performs an expensive file listing operation that can really slow things down. Spark is really bad at listing files. I've compared Spark file listing times with AWS CLI and don't know why it takes Spark so long to list files.

You should rephrase "My understanding is that predicate pushdown..." to "my understanding is that partition filters...". Predicate pushdown filtering is different.

This is also an issue with Delta Data lakes. It's actually worse with Delta data lakes because the work-around you mentioned to avoid the file listing doesn't work with Delta.

In short, you're not doing anything wrong and this is the expected behavior. You only have 400 partitions, so the unnecessary file listing isn't so bad in your case. Imagine how slow this gets when you have 20,000 partitions!

Spark predicate pushdown performance

Tags:

apache-spark

parquet

Duke Silver

1 Answers

Powers

Recent Activity

Donate For Us

Spark predicate pushdown performance

Tags:

apache-spark

parquet

Duke Silver

1 Answers

Powers

Related questions

Recent Activity

Donate For Us