I have a table with time-series data that's mostly nulls, and I want to fill in all of the nulls with the last known value.
I have a few solutions, but they're much slower than doing the equivalent DataFrame.fillna(method='ffill')
operation in Pandas.
A simplified version of the code / data that I'm using:
select d.[date], d.[price],
(select top 1 p.price from price_table p
where p.price is not null and p.[date] <= p.[date]
order by p.[date] desc) as ff_price
from price_table d
To produce the table
date price ff_price
---------- ----- --------
2016-07-11 0.79 0.79
2016-07-12 NULL 0.79
2016-07-13 NULL 0.79
2016-07-14 0.69 0.69
2016-07-15 NULL 0.69
...
2016-09-21 0.88 0.88
...
I have >100 million rows, so this takes quite a while.
This looks like a "classic" gaps and island question. Assuming you're not using 2008 or prior (which are all (almost) entirely out of support) this should get you the result you're after:
WITH CTE AS(
SELECT [date],
price,
COUNT(CASE WHEN price IS NOT NULL THEN 1 END) OVER (ORDER BY [date]
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Grp
FROM price_table p)
SELECT [date],
price,
MIN(price) OVER (PARTITION BY grp) AS ff_price
FROM CTE;
db<>fiddle
Assuming that your column is DATE
and price is DECIMAL(5,2)
, please test this approach:
SELECT
P.[date],
P.[price],
ff_price = CONVERT(
DECIMAL(5,2), -- Original price datatype
SUBSTRING(
MAX(
CAST(P.[date] AS BINARY(3)) + -- 3: datalength of P.[date] column
CAST(P.[price] AS BINARY(5)) -- 5: datalength of P.[price] column
) OVER (ORDER BY P.[date] ROWS UNBOUNDED PRECEDING),
4, -- Position to start that's not the binary part of the date
5))-- Characters that compose the binary of the original price datatype
FROM
price_table AS P
This is a solution I implemented with a similar problem and you can find the exaustive explanation here. The reason this approach is good is because it doesn't require a explicit sort, as long as you have an index by date
.
What it does is basically use a windowed MAX
with the concatenation of the 3 bytes that composes your date column (this is why I mentioned that you column must be DATE
, otherwise DATETIME
will need 8 bytes, you can edit the query to work with this) with the bytes that compose your price column (which are 5 bytes, also assumed). This is the CAST(P.[date] AS BINARY(3)) + CAST(P.[price] AS BINARY(5))
part.
When you calculate this and ORDER BY P.[date] ROWS UNBOUNDED PRECEDING
, the engine is basically doing rolling max with values which most significant bytes are your dates. The max result will always update when the date changes, but considering that concatenating any value with NULL
as price will also yield NULL
(as binary), then the MAX
will always ignore this value and retain the previous non-null MAX
(by P.[date] ROWS UNBOUNDED PRECEDING
).
This is the binary result of the windowed MAX
(I added a previous record with NULL
so you see that result is NULL
for null prices values):
date price ff_price WindowedMax
2016-07-10 NULL NULL NULL
2016-07-11 0.79 0.79 0x9B3B0B050200014F
2016-07-12 NULL 0.79 0x9B3B0B050200014F
2016-07-13 NULL 0.79 0x9B3B0B050200014F
2016-07-14 0.69 0.69 0x9E3B0B0502000145
2016-07-15 NULL 0.69 0x9E3B0B0502000145
2016-07-21 0.88 0.88 0xA53B0B0502000158
2016-07-22 NULL 0.88 0xA53B0B0502000158
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With