I have parquet files stored in partitions by date in directories like:
/activity
/date=20180802
I'm using Spark 2.2 and there are 400+ partitions. My understanding is that predicate pushdown should allow me to run a query like the one below and get quick results.
spark.read.parquet(".../activity")
.filter($"date" === "20180802" && $"id" === "58ff800af2")
.show()
However, the query above is taking around 90 seconds while the query below takes around 5 seconds. Am I doing something wrong or is this expected behavior?
spark.read.parquet(".../activity/date=20180802")
.filter($"id" === "58ff800af2")
.show()
I noticed this too and talked about it at a Spark Summit presentation.
Spark performs an expensive file listing operation that can really slow things down. Spark is really bad at listing files. I've compared Spark file listing times with AWS CLI and don't know why it takes Spark so long to list files.
You should rephrase "My understanding is that predicate pushdown..." to "my understanding is that partition filters...". Predicate pushdown filtering is different.
This is also an issue with Delta Data lakes. It's actually worse with Delta data lakes because the work-around you mentioned to avoid the file listing doesn't work with Delta.
In short, you're not doing anything wrong and this is the expected behavior. You only have 400 partitions, so the unnecessary file listing isn't so bad in your case. Imagine how slow this gets when you have 20,000 partitions!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With