I want to find the last value of y
over an ordered partition using a query like this:
SELECT
x,
LAST_VALUE(y) OVER (PARTITION BY x ORDER BY y ASC)
FROM table
But LAST_VALUE
returns lots of values that aren't the last value (in this case, the largest value) of y
for a given partition. Why?
(In this case, MAX
can be used instead of LAST_VALUE
to find the largest value, but why doesn't LAST_VALUE
return the max value too?)
TLDR: The query you want is:
SELECT
x,
LAST_VALUE(y) OVER (PARTITION BY x ORDER BY y ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM table
Possibly followed by GROUP BY
to collapse duplicate output rows from the analytic function.
And of course, it's simpler to just use MAX
over an unordered partition if that's all you need:
SELECT
x,
MAX(y) OVER (PARTITION BY x)
FROM table
Before answering this question, here's a little background on analytic functions (a.k.a. window functions). All of the below is standard SQL and not specific to BigQuery.
First, analytic functions are not aggregation functions. Whereas aggregation functions collapse multiple input rows into a single output row, analytic functions compute exactly one output row for every input row. So you need to make sure you're thinking about what the output is for every input row.
Second, analytic functions operate over a "window" of rows that is a subset of the "partition" to which the row belongs. The partition for an input row is determined by the PARTITION BY
clause, or you can omit it if you want the partition to be the entire set of input rows. The window is given by the ROWS
clause, but if you don't specify it (and users usually don't), it defaults to either the entire partition (when no ordering is applied) or the set of rows in the partition from the first row to the current row (when an ORDER BY
is present). Note that the window can differ for each input row in a partition!
Now, back to LAST_VALUE
. Although the default window described above is reasonable in many cases (e.g., computing cumulative sums), it works spectacularly poorly with LAST_VALUE
. The LAST_VALUE
function returns the value of the last row in the window, and by default the last row in the window is the current row.
So to fix the problem, you need to explicitly specify that the window for LAST_VALUE
is the entire partition, not just the rows up to the current row. You can do so as follows:
SELECT x, LAST_VALUE(y) OVER (PARTITION BY x ORDER BY y ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM table
To test this out, here's an example:
SELECT
x,
FIRST_VALUE(x) OVER (ORDER BY x ASC) first_asc,
FIRST_VALUE(x) OVER (ORDER BY x DESC) first_desc,
LAST_VALUE(x) OVER (ORDER BY x ASC) last_asc,
LAST_VALUE(x) OVER (ORDER BY x DESC) last_desc,
FROM
(SELECT 4 as x),
(SELECT 2 as x),
(SELECT 1 as x),
(SELECT 3 as x)
x,first_asc,first_desc,last_asc,last_desc
1,1,4,1,1
2,1,4,2,2
3,1,4,3,3
4,1,4,4,4
Note that LAST_VALUE
returns 1, 2, 3, 4 instead of just 4 because the window changes for each input row.
Now let's specify a window that is the entire partition:
SELECT
x,
FIRST_VALUE(x) OVER (ORDER BY x ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) first_asc,
FIRST_VALUE(x) OVER (ORDER BY x DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) first_desc,
LAST_VALUE(x) OVER (ORDER BY x ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) last_asc,
LAST_VALUE(x) OVER (ORDER BY x DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) last_desc,
FROM
(SELECT 4 as x),
(SELECT 2 as x),
(SELECT 1 as x),
(SELECT 3 as x)
x,first_asc,first_desc,last_asc,last_desc
1,1,4,4,1
2,1,4,4,1
3,1,4,4,1
4,1,4,4,1
Now we get 4 for LAST_VALUE
as expected.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With