Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark predicate pushdown performance

I have parquet files stored in partitions by date in directories like:

/activity
    /date=20180802

I'm using Spark 2.2 and there are 400+ partitions. My understanding is that predicate pushdown should allow me to run a query like the one below and get quick results.

spark.read.parquet(".../activity")
    .filter($"date" === "20180802" && $"id" === "58ff800af2")
    .show()

However, the query above is taking around 90 seconds while the query below takes around 5 seconds. Am I doing something wrong or is this expected behavior?

spark.read.parquet(".../activity/date=20180802")
    .filter($"id" === "58ff800af2")
    .show()
like image 748
Duke Silver Avatar asked Aug 15 '18 01:08

Duke Silver


1 Answers

I noticed this too and talked about it at a Spark Summit presentation.

Spark performs an expensive file listing operation that can really slow things down. Spark is really bad at listing files. I've compared Spark file listing times with AWS CLI and don't know why it takes Spark so long to list files.

You should rephrase "My understanding is that predicate pushdown..." to "my understanding is that partition filters...". Predicate pushdown filtering is different.

This is also an issue with Delta Data lakes. It's actually worse with Delta data lakes because the work-around you mentioned to avoid the file listing doesn't work with Delta.

In short, you're not doing anything wrong and this is the expected behavior. You only have 400 partitions, so the unnecessary file listing isn't so bad in your case. Imagine how slow this gets when you have 20,000 partitions!

like image 138
Powers Avatar answered Nov 12 '22 10:11

Powers