Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't LAST_VALUE return the last value?

I want to find the last value of y over an ordered partition using a query like this:

SELECT
  x,
  LAST_VALUE(y) OVER (PARTITION BY x ORDER BY y ASC)
FROM table

But LAST_VALUE returns lots of values that aren't the last value (in this case, the largest value) of y for a given partition. Why?

(In this case, MAX can be used instead of LAST_VALUE to find the largest value, but why doesn't LAST_VALUE return the max value too?)

like image 618
Jeremy Condit Avatar asked Jan 30 '16 00:01

Jeremy Condit


1 Answers

TLDR: The query you want is:

SELECT
  x,
  LAST_VALUE(y) OVER (PARTITION BY x ORDER BY y ASC
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM table

Possibly followed by GROUP BY to collapse duplicate output rows from the analytic function.

And of course, it's simpler to just use MAX over an unordered partition if that's all you need:

SELECT
  x,
  MAX(y) OVER (PARTITION BY x)
FROM table

Before answering this question, here's a little background on analytic functions (a.k.a. window functions). All of the below is standard SQL and not specific to BigQuery.

First, analytic functions are not aggregation functions. Whereas aggregation functions collapse multiple input rows into a single output row, analytic functions compute exactly one output row for every input row. So you need to make sure you're thinking about what the output is for every input row.

Second, analytic functions operate over a "window" of rows that is a subset of the "partition" to which the row belongs. The partition for an input row is determined by the PARTITION BY clause, or you can omit it if you want the partition to be the entire set of input rows. The window is given by the ROWS clause, but if you don't specify it (and users usually don't), it defaults to either the entire partition (when no ordering is applied) or the set of rows in the partition from the first row to the current row (when an ORDER BY is present). Note that the window can differ for each input row in a partition!

Now, back to LAST_VALUE. Although the default window described above is reasonable in many cases (e.g., computing cumulative sums), it works spectacularly poorly with LAST_VALUE. The LAST_VALUE function returns the value of the last row in the window, and by default the last row in the window is the current row.

So to fix the problem, you need to explicitly specify that the window for LAST_VALUE is the entire partition, not just the rows up to the current row. You can do so as follows:

SELECT x, LAST_VALUE(y) OVER (PARTITION BY x ORDER BY y ASC
  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM table

To test this out, here's an example:

SELECT
  x,
  FIRST_VALUE(x) OVER (ORDER BY x ASC) first_asc,
  FIRST_VALUE(x) OVER (ORDER BY x DESC) first_desc,
  LAST_VALUE(x) OVER (ORDER BY x ASC) last_asc,
  LAST_VALUE(x) OVER (ORDER BY x DESC) last_desc,
FROM
  (SELECT 4 as x),
  (SELECT 2 as x),
  (SELECT 1 as x),
  (SELECT 3 as x)

x,first_asc,first_desc,last_asc,last_desc
1,1,4,1,1
2,1,4,2,2
3,1,4,3,3
4,1,4,4,4

Note that LAST_VALUE returns 1, 2, 3, 4 instead of just 4 because the window changes for each input row.

Now let's specify a window that is the entire partition:

SELECT
  x,
  FIRST_VALUE(x) OVER (ORDER BY x ASC
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) first_asc,
  FIRST_VALUE(x) OVER (ORDER BY x DESC
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) first_desc,
  LAST_VALUE(x) OVER (ORDER BY x ASC
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) last_asc,
  LAST_VALUE(x) OVER (ORDER BY x DESC
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) last_desc,
FROM
  (SELECT 4 as x),
  (SELECT 2 as x),
  (SELECT 1 as x),
  (SELECT 3 as x)

x,first_asc,first_desc,last_asc,last_desc
1,1,4,4,1
2,1,4,4,1
3,1,4,4,1
4,1,4,4,1

Now we get 4 for LAST_VALUE as expected.

like image 182
Jeremy Condit Avatar answered Sep 29 '22 11:09

Jeremy Condit