Spark SQL on partition columns without reading full row data

Question

The situation where you want to execute a Spark SQL query on only partition columns comes up pretty often. For example, let's say you want to programmatically get the latest date from a table (where date is a partition column). Conceptually, you can do something like this:

val myData = spark.table("chrisa.my_table")
val latestDate = myData.select($"date").distinct
                     .orderBy($"date".desc)
                     .limit(1).collect

I'm assuming that "date" is a column you can order by, perhaps it's a string formatted by "YYYYmmDD" such as 20220908

Sometimes I've seen code similar to this be very slow, i.e. it's clearly reading the underlying file data and sometimes I've seen it be nearly instant, i.e. probably only accessing table metadata about the partition columns without reading the files.

Questions are: What are the conditions under which this won't have to read the underlying full row data? Specifically,

What has to be true about the storage format or metadata?
Which transformations will allow this to work without reading the full row data?

I'm using Spark 2.4.2

Chris A. · Accepted Answer

After a little more digging I found these two JIRAs answering my question exactly.

https://issues.apache.org/jira/browse/SPARK-12890
https://issues.apache.org/jira/browse/SPARK-34194

It seems that the first JIRA was closed in favor of the second because it describes the problem better.

The summary from these two seems to be that there used to be config parameter spark.sql.optimizer.metadataOnly that used to allow what I wanted to do, but this can lead to incorrect results if the underlying files have zero rows. So the parameter and the whole concept was deprecated and removed to favor correctness over speed.

Even my example in the original post shows a case that could give two different results if the underlying file data had zero rows:

val myData = spark.table("chrisa.my_table")
val latestDate = myData.select($"date").distinct
                     .orderBy($"date".desc)
                     .limit(1).collect

For example, if the latest date for which there was metadata had zero rows then a full scan would yield the next latest date (which actually had rows), but the metadata only version would yield a the latest date that had metadata.

Spark SQL on partition columns without reading full row data

Tags:

scala

apache-spark

apache-spark-sql

Chris A.

1 Answers

Chris A.

Recent Activity

Donate For Us

Spark SQL on partition columns without reading full row data

Tags:

scala

apache-spark

apache-spark-sql

Chris A.

1 Answers

Chris A.

Related questions

Recent Activity

Donate For Us