The situation where you want to execute a Spark SQL query on only partition columns comes up pretty often. For example, let's say you want to programmatically get the latest date from a table (where date is a partition column). Conceptually, you can do something like this:
val myData = spark.table("chrisa.my_table")
val latestDate = myData.select($"date").distinct
.orderBy($"date".desc)
.limit(1).collect
I'm assuming that "date" is a column you can order by, perhaps it's a string formatted by "YYYYmmDD" such as 20220908
Sometimes I've seen code similar to this be very slow, i.e. it's clearly reading the underlying file data and sometimes I've seen it be nearly instant, i.e. probably only accessing table metadata about the partition columns without reading the files.
Questions are: What are the conditions under which this won't have to read the underlying full row data? Specifically,
I'm using Spark 2.4.2
After a little more digging I found these two JIRAs answering my question exactly.
It seems that the first JIRA was closed in favor of the second because it describes the problem better.
The summary from these two seems to be that there used to be config parameter spark.sql.optimizer.metadataOnly that used to allow what I wanted to do, but this can lead to incorrect results if the underlying files have zero rows. So the parameter and the whole concept was deprecated and removed to favor correctness over speed.
Even my example in the original post shows a case that could give two different results if the underlying file data had zero rows:
val myData = spark.table("chrisa.my_table")
val latestDate = myData.select($"date").distinct
.orderBy($"date".desc)
.limit(1).collect
For example, if the latest date for which there was metadata had zero rows then a full scan would yield the next latest date (which actually had rows), but the metadata only version would yield a the latest date that had metadata.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With