How can I tell the LAG function to get the last "not null" value?
For example, see my table bellow where I have a few NULL values on column B and C. I'd like to fill the nulls with the last non-null value. I tried to do that by using the LAG function, like so:
case when B is null then lag (B) over (order by idx) else B end as B,
but that doesn't quite work when I have two or more nulls in a row (see the NULL value on column C row 3 - I'd like it to be 0.50 as the original).
Any idea how can I achieve that? (it doesn't have to be using the LAG function, any other ideas are welcome)
A few assumptions:
Thanks
The LAG function is used to access data from PREVIOUS rows along with data from the current row. An ORDER BY clause is required when working with LEAD and LAG functions, but a PARTITION BY clause is optional.
LAG provides access to a row at a given physical offset that comes before the current row. Use this analytic function in a SELECT statement to compare values in the current row with values in a previous row.
lag: Lag a Time SeriesCompute a lagged version of a time series, shifting the time base back by a given number of observations. lag is a generic function; this page documents its default method.
The Syntax of the LEAD Function Just like LAG() , the LEAD() function takes three arguments: the name of a column or an expression, the offset to be skipped below, and the default value to be returned if the stored value obtained from the row below is empty. Only the first argument is required.
You can do it with outer apply
operator:
select t.id,
t1.colA,
t2.colB,
t3.colC
from table t
outer apply(select top 1 colA from table where id <= t.id and colA is not null order by id desc) t1
outer apply(select top 1 colB from table where id <= t.id and colB is not null order by id desc) t2
outer apply(select top 1 colC from table where id <= t.id and colC is not null order by id desc) t3;
This will work, regardless of the number of nulls or null "islands". You may have values, then nulls, then again values, again nulls. It will still work.
If, however the assumption (in your question) holds:
Once I have a
NULL
, isNULL
all up to the end - so I want to fill it with the latest value.
there is a more efficient solution. We only need to find the latest (when ordered by idx
) values. Modifying the above query, removing the where id <= t.id
from the subqueries:
select t.id,
colA = coalesce(t.colA, t1.colA),
colB = coalesce(t.colB, t2.colB),
colC = coalesce(t.colC, t3.colC)
from table t
outer apply (select top 1 colA from table
where colA is not null order by id desc) t1
outer apply (select top 1 colB from table
where colB is not null order by id desc) t2
outer apply (select top 1 colC from table
where colC is not null order by id desc) t3;
You could make a change to your ORDER BY
, to force the NULLs to be first in your ordering, but that may be expensive...
lag(B) over (order by CASE WHEN B IS NULL THEN -1 ELSE idx END)
Or, use a sub-query to calculate the replacement value once. Possibly less expensive on larger sets, but very clunky.
- Relies on all the NULLs coming at the end
- The LAG doesn't rely on that
COALESCE(
B,
(
SELECT
sorted_not_null.B
FROM
(
SELECT
table.B,
ROW_NUMBER() OVER (ORDER BY table.idx DESC) AS row_id
FROM
table
WHERE
table.B IS NOT NULL
)
sorted_not_null
WHERE
sorted_not_null.row_id = 1
)
)
(This should be faster on larger data-sets, than LAG
or using OUTER APPLY
with correlated sub-queries, simply because the value is calculated once. For tidiness, you could calculate and store the [last_known_value] for each column in variables, then just use COALESCE(A, @last_known_A), COALESCE(B, @last_known_B), etc
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With