Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to forward-fill nulls in time-series data using T-SQL

I have a table with time-series data that's mostly nulls, and I want to fill in all of the nulls with the last known value.

I have a few solutions, but they're much slower than doing the equivalent DataFrame.fillna(method='ffill') operation in Pandas.

A simplified version of the code / data that I'm using:

select d.[date], d.[price],
       (select top 1 p.price from price_table p
        where p.price is not null and p.[date] <= p.[date]
        order by p.[date] desc) as ff_price
from price_table d

To produce the table

date       price ff_price
---------- ----- --------
2016-07-11 0.79  0.79
2016-07-12 NULL  0.79
2016-07-13 NULL  0.79
2016-07-14 0.69  0.69
2016-07-15 NULL  0.69
...
2016-09-21 0.88  0.88
...

I have >100 million rows, so this takes quite a while.

like image 734
user448830 Avatar asked Jan 27 '23 16:01

user448830


2 Answers

This looks like a "classic" gaps and island question. Assuming you're not using 2008 or prior (which are all (almost) entirely out of support) this should get you the result you're after:

WITH CTE AS(
    SELECT [date],
           price,
           COUNT(CASE WHEN price IS NOT NULL THEN 1 END) OVER (ORDER BY [date]
                                                               ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Grp
    FROM price_table p)
SELECT [date],
       price,
       MIN(price) OVER (PARTITION BY grp) AS ff_price
FROM CTE;

db<>fiddle

like image 112
Larnu Avatar answered Feb 17 '23 12:02

Larnu


Assuming that your column is DATE and price is DECIMAL(5,2), please test this approach:

SELECT
    P.[date],
    P.[price],
    ff_price = CONVERT(
        DECIMAL(5,2),       -- Original price datatype
        SUBSTRING(
            MAX(
                CAST(P.[date] AS BINARY(3)) +   -- 3: datalength of P.[date] column
                CAST(P.[price] AS BINARY(5))    -- 5: datalength of P.[price] column
            ) OVER (ORDER BY P.[date] ROWS UNBOUNDED PRECEDING),

            4,  -- Position to start that's not the binary part of the date

            5))-- Characters that compose the binary of the original price datatype
FROM
    price_table  AS P

This is a solution I implemented with a similar problem and you can find the exaustive explanation here. The reason this approach is good is because it doesn't require a explicit sort, as long as you have an index by date.

What it does is basically use a windowed MAX with the concatenation of the 3 bytes that composes your date column (this is why I mentioned that you column must be DATE, otherwise DATETIME will need 8 bytes, you can edit the query to work with this) with the bytes that compose your price column (which are 5 bytes, also assumed). This is the CAST(P.[date] AS BINARY(3)) + CAST(P.[price] AS BINARY(5)) part.

When you calculate this and ORDER BY P.[date] ROWS UNBOUNDED PRECEDING, the engine is basically doing rolling max with values which most significant bytes are your dates. The max result will always update when the date changes, but considering that concatenating any value with NULL as price will also yield NULL (as binary), then the MAX will always ignore this value and retain the previous non-null MAX (by P.[date] ROWS UNBOUNDED PRECEDING).

This is the binary result of the windowed MAX (I added a previous record with NULL so you see that result is NULL for null prices values):

date        price   ff_price    WindowedMax
2016-07-10  NULL    NULL        NULL
2016-07-11  0.79    0.79        0x9B3B0B050200014F
2016-07-12  NULL    0.79        0x9B3B0B050200014F
2016-07-13  NULL    0.79        0x9B3B0B050200014F
2016-07-14  0.69    0.69        0x9E3B0B0502000145
2016-07-15  NULL    0.69        0x9E3B0B0502000145
2016-07-21  0.88    0.88        0xA53B0B0502000158
2016-07-22  NULL    0.88        0xA53B0B0502000158
like image 30
EzLo Avatar answered Feb 17 '23 12:02

EzLo