Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to ignore nulls in BigQuery using LAG()?

Tags:

When using LAG() (in BigQuery standard SQL), how do you skip NULL values so that it takes the first preceding value which is not NULL?

I've prepared some sample rows in the same format in the source table, but obfuscated. In the example, it only works for rows that do not have a preceding NULL value. Specifically, rows 3 & 4 should be assigned '2017-01-25 04:02:36' (as was the case for row 5), but they are NULL.

This makes sense. However, surely there is an easy way to do specify something like INGORE_NULLS?

--TEMP
with example as (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 03:19:50') as col_c, 'val_1' as col_d 
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 03:19:50') as col_c, 'val_2' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 03:19:50') as col_c, 'val_3' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:01:23') as col_c, 'val_1' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:01:23') as col_c, 'val_2' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:01:23') as col_c, 'val_3' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:01:59') as col_c, 'val_1' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:01:59') as col_c, 'val_2' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:01:59') as col_c, 'val_3' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:02:36') as col_c, 'val_1' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:02:36') as col_c, 'val_2' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:02:36') as col_c, 'val_3' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:02:55') as col_c, 'val_1' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:02:55') as col_c, 'val_3' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 07:16:58') as col_c, 'val_1' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 07:16:58') as col_c, 'val_3' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 09:35:39') as col_c, 'val_1' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 09:35:39') as col_c, 'val_3' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 09:46:48') as col_c, 'val_1' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 09:46:48') as col_c, 'val_2' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 09:46:48') as col_c, 'val_3' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 10:47:48') as col_c, 'val_2' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 10:47:48') as col_c, 'val_3' as col_d))
--TEMP
SELECT col_a, col_b, col_c,
  case when val_1_transposed is null then LAG(val_1_transposed) over (order by col_c) else val_1_transposed end as val_1_transposed,
  case when val_2_transposed is null then LAG(val_2_transposed) over (order by col_c) else val_2_transposed end as val_2_transposed,
  case when val_3_transposed is null then LAG(val_3_transposed) over (order by col_c) else val_3_transposed end as val_3_transposed
FROM (
  SELECT col_a, col_b, col_c,
    MAX(IF(col_d = 'val_1', col_c, NULL)) AS val_1_transposed,
    MAX(IF(col_d = 'val_2', col_c, NULL)) AS val_2_transposed,
    MAX(IF(col_d = 'val_3', col_c, NULL)) AS val_3_transposed
  FROM (
    SELECT col_a, col_b, col_c, col_d FROM example) GROUP BY 1,2,3) ORDER BY col_c DESC

enter image description here

like image 976
Graham Polley Avatar asked Apr 24 '17 10:04

Graham Polley


People also ask

How do I filter null values in BigQuery?

The syntax is as follows: Select * from table_source where column is not NULL; If you want to read more about the where operator, please refer to the documentation. In addition if you want to replace the null values, you can use the IFNNULL() function.

Does First_value ignore nulls?

If the first value in the set is null, then the function returns NULL unless you specify IGNORE NULLS . This setting is useful for data densification. If you specify IGNORE NULLS , then FIRST_VALUE returns the fist non-null value in the set, or NULL if all values are null.

What does the BigQuery lag function do?

LAG. Returns the value of the value_expression on a preceding row. Changing the offset value changes which preceding row is returned; the default value is 1 , indicating the previous row in the window frame.

What is coalesce in BigQuery?

COALESCE(expr[, ... ]) Description. Returns the value of the first non-null expression. The remaining expressions are not evaluated.


1 Answers

There are two solutions, described in detail here: http://sqlmag.com/t-sql/last-non-null-puzzle

I adapted one of the solutions, basically it uses a MAX window aggregate function to return the maximum relevant id so far. By using ROWS UNBOUNDED PRECEDING you can constantly reach new MAX levels, than are carried over and replace the NULL lag entries.

--TEMP
with example as (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 03:19:50') as col_c, 'val_1' as col_d 
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 03:19:50') as col_c, 'val_2' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 03:19:50') as col_c, 'val_3' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:01:23') as col_c, 'val_1' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:01:23') as col_c, 'val_2' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:01:23') as col_c, 'val_3' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:01:59') as col_c, 'val_1' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:01:59') as col_c, 'val_2' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:01:59') as col_c, 'val_3' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:02:36') as col_c, 'val_1' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:02:36') as col_c, 'val_2' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:02:36') as col_c, 'val_3' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:02:55') as col_c, 'val_1' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 04:02:55') as col_c, 'val_3' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 07:16:58') as col_c, 'val_1' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 07:16:58') as col_c, 'val_3' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 09:35:39') as col_c, 'val_1' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 09:35:39') as col_c, 'val_3' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 09:46:48') as col_c, 'val_1' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 09:46:48') as col_c, 'val_2' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 09:46:48') as col_c, 'val_3' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 10:47:48') as col_c, 'val_2' as col_d)
UNION ALL (select 'some_id' as col_a, 'foo' as col_b, timestamp('2017-01-25 10:47:48') as col_c, 'val_3' as col_d))
--TEMP
SELECT col_a, col_b, col_c,
  case when val_1_transposed is null then LAG(val_1_transposed) over (order by col_c) else val_1_transposed end as val_1_transposed,
  case when val_2_transposed is null then LAG(val_2_transposed) over (order by col_c) else val_2_transposed end as val_2_transposed,
  case when val_3_transposed is null then LAG(val_3_transposed) over (order by col_c) else val_3_transposed end as val_3_transposed,
  MAX(val_2_transposed) OVER( PARTITION BY grp ORDER BY col_a ROWS UNBOUNDED PRECEDING ) as lag_ignored_nulls
FROM (
select *, 
  MAX(CASE WHEN val_2_transposed IS NOT NULL THEN col_a END ) OVER( ORDER BY col_a ROWS UNBOUNDED PRECEDING ) AS grp
            from (
  SELECT col_a, col_b, col_c,
    MAX(IF(col_d = 'val_1', col_c, NULL)) AS val_1_transposed,
    MAX(IF(col_d = 'val_2', col_c, NULL)) AS val_2_transposed,
    MAX(IF(col_d = 'val_3', col_c, NULL)) AS val_3_transposed
  FROM (
    SELECT col_a, col_b, col_c, col_d FROM example) GROUP BY 1,2,3)) ORDER BY col_c DESC

enter image description here

like image 86
Pentium10 Avatar answered Sep 25 '22 11:09

Pentium10