Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LAG functions and NULLS

How can I tell the LAG function to get the last "not null" value?

For example, see my table bellow where I have a few NULL values on column B and C. I'd like to fill the nulls with the last non-null value. I tried to do that by using the LAG function, like so:

case when B is null then lag (B) over (order by idx) else B end as B,

but that doesn't quite work when I have two or more nulls in a row (see the NULL value on column C row 3 - I'd like it to be 0.50 as the original).

Any idea how can I achieve that? (it doesn't have to be using the LAG function, any other ideas are welcome)

A few assumptions:

  • The number of rows is dynamic;
  • The first value will always be non-null;
  • Once I have a NULL, is NULL all up to the end - so I want to fill it with the latest value.

Thanks

enter image description here

like image 354
Diego Avatar asked Apr 25 '16 10:04

Diego


People also ask

What is a lag function?

The LAG function is used to access data from PREVIOUS rows along with data from the current row. An ORDER BY clause is required when working with LEAD and LAG functions, but a PARTITION BY clause is optional.

How does lag function work in SQL?

LAG provides access to a row at a given physical offset that comes before the current row. Use this analytic function in a SELECT statement to compare values in the current row with values in a previous row.

What is LAG () in R?

lag: Lag a Time SeriesCompute a lagged version of a time series, shifting the time base back by a given number of observations. lag is a generic function; this page documents its default method.

What is the order of the three arguments for the lag and lead functions?

The Syntax of the LEAD Function Just like LAG() , the LEAD() function takes three arguments: the name of a column or an expression, the offset to be skipped below, and the default value to be returned if the stored value obtained from the row below is empty. Only the first argument is required.


2 Answers

You can do it with outer apply operator:

select t.id,
       t1.colA,
       t2.colB,
       t3.colC 
from table t
outer apply(select top 1 colA from table where id <= t.id and colA is not null order by id desc) t1
outer apply(select top 1 colB from table where id <= t.id and colB is not null order by id desc) t2
outer apply(select top 1 colC from table where id <= t.id and colC is not null order by id desc) t3;

This will work, regardless of the number of nulls or null "islands". You may have values, then nulls, then again values, again nulls. It will still work.


If, however the assumption (in your question) holds:

Once I have a NULL, is NULL all up to the end - so I want to fill it with the latest value.

there is a more efficient solution. We only need to find the latest (when ordered by idx) values. Modifying the above query, removing the where id <= t.id from the subqueries:

select t.id,
       colA = coalesce(t.colA, t1.colA),
       colB = coalesce(t.colB, t2.colB),
       colC = coalesce(t.colC, t3.colC) 
from table t
outer apply (select top 1 colA from table 
             where colA is not null order by id desc) t1
outer apply (select top 1 colB from table 
             where colB is not null order by id desc) t2
outer apply (select top 1 colC from table 
             where colC is not null order by id desc) t3;
like image 69
Giorgi Nakeuri Avatar answered Oct 01 '22 09:10

Giorgi Nakeuri


You could make a change to your ORDER BY, to force the NULLs to be first in your ordering, but that may be expensive...

lag(B) over (order by CASE WHEN B IS NULL THEN -1 ELSE idx END)

Or, use a sub-query to calculate the replacement value once. Possibly less expensive on larger sets, but very clunky.
- Relies on all the NULLs coming at the end
- The LAG doesn't rely on that

COALESCE(
    B,
    (
        SELECT
            sorted_not_null.B
        FROM
        (
            SELECT
                table.B,
                ROW_NUMBER() OVER (ORDER BY table.idx DESC)   AS row_id
            FROM
                table
            WHERE
                table.B IS NOT NULL
        )
           sorted_not_null
        WHERE
           sorted_not_null.row_id = 1
    )
)

(This should be faster on larger data-sets, than LAG or using OUTER APPLY with correlated sub-queries, simply because the value is calculated once. For tidiness, you could calculate and store the [last_known_value] for each column in variables, then just use COALESCE(A, @last_known_A), COALESCE(B, @last_known_B), etc)

like image 27
MatBailie Avatar answered Oct 01 '22 08:10

MatBailie