Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fill in gaps in data, using a value proportional to the gap distance to data from the surrounding rows?

At some point soon I'll have to prepare a list of prices of items on days. The granularity is 1 day and on days when there are sales of an item, I'll average the prices to get that day's average. There will be days where no sales are made, and I'm suited that an adequate approximation can be used by pulling the previous and next occurrences of sales, and for each day between them having a price that transitions linearly from one to the other.

Imagine the raw data is:

Item   Date       Price
Bread  2000-01-01 10
Bread  2000-01-02 9.5
Bread  2000-01-04 9.1
Sugar  2000-01-01 100
Sugar  2000-01-11 150

I can get to here:

Item   Date       Price
Bread  2000-01-01 10
Bread  2000-01-02 9.5
Bread  2000-01-03 NULL
Bread  2000-01-04 9.1
Sugar  2000-01-01 100
Sugar  2000-01-02 NULL
Sugar  2000-01-03 NULL
Sugar  2000-01-04 NULL
Sugar  2000-01-05 NULL
Sugar  2000-01-06 NULL
Sugar  2000-01-07 NULL
Sugar  2000-01-08 NULL
Sugar  2000-01-09 NULL
Sugar  2000-01-10 NULL
Sugar  2000-01-11 150

Where I want to get to is:

Item   Date       Price
Bread  2000-01-01 10
Bread  2000-01-02 9.5
Bread  2000-01-03 9.3 --being 9.5 + ((9.1 - 9.5 / 2) * 1)
Bread  2000-01-04 9.1
Sugar  2000-01-01 100
Sugar  2000-01-02 105 --being 100 + (150 - 100 / 10) * 1)
Sugar  2000-01-03 110 --being 100 + (150 - 100 / 10) * 2)
Sugar  2000-01-04 115
Sugar  2000-01-05 120
Sugar  2000-01-06 125
Sugar  2000-01-07 130
Sugar  2000-01-08 135
Sugar  2000-01-09 140
Sugar  2000-01-10 145 --being 100 + (150 - 100 / 10) * 9)
Sugar  2000-01-11 150

What have I tried so far? Thinking only; I'm planning on doing something like:

  • Pull the raw data
  • Join to a numbers/calendar table to pad out to sparse data
  • LAST_VALUE() (or first?) OVER ROWS UNBOUNDED PRECEDING/FOLLOWING (with a nulls-last order clause) to get the first non-null preceding_date, following_date, preceding_price and following_price from the raw data
  • DATEDIFF the fake date and the preceding_date to get a number of days (this is effectively how far across the gap we are, gap_progress) and the gap distance (following_date - preceding_date)
  • get the next price, previous price and gap distance for the formula (preceding_price + ((next_price - preceding_price)/gap_distance) * gap_progress)

I am, however, wondering if there's a simpler way, because I've got millions of item-days and this doesn't feel like it'll be that efficient..

I find lots of examples of questions where the data from the last or next row is smeared verbatim to fill in the gaps, but I don't recall seeing this situation where some kind of transition is attempted. Perhaps this technique can be doubly applied, by having a smear that runs forwards, replicating the most recent value and alongside it a smear that runs backwards:

Item   Date       DateFwd    DateBak     PriceF PriceB
Bread  2000-01-01 2000-01-01 2000-01-01  10     10
Bread  2000-01-02 2000-01-02 2000-01-02  9.5    9.5
Bread  2000-01-03 2000-01-02 2000-01-04  9.5    9.1
Bread  2000-01-04 2000-01-04 2000-01-04  9.1    9.1
Sugar  2000-01-01 2000-01-01 2000-01-01  100    100
Sugar  2000-01-02 2000-01-01 2000-01-11  100    150
Sugar  2000-01-03 2000-01-01 2000-01-11  100    150
Sugar  2000-01-04 2000-01-01 2000-01-11  100    150
Sugar  2000-01-05 2000-01-01 2000-01-11  100    150
Sugar  2000-01-06 2000-01-01 2000-01-11  100    150
Sugar  2000-01-07 2000-01-01 2000-01-11  100    150
Sugar  2000-01-08 2000-01-01 2000-01-11  100    150
Sugar  2000-01-09 2000-01-01 2000-01-11  100    150
Sugar  2000-01-10 2000-01-01 2000-01-11  100    150
Sugar  2000-01-11 2000-01-11 2000-01-11  150    150

These might provide the necessary data for the formula (preceding_price + ((next_price - preceding_price)/gap_distance) * gap_progress):

  • gap_distance = DATEDIFF(day, DateFwd, DateBak)
  • gap_progress = DATEDIFF(day, Date, DateFwd)
  • next_price = PriceB
  • preceding_price = PriceF

?

Here's a DDL of the data that I know I can get to (raw data joined with calendar table)

CREATE TABLE Data
([I] varchar(5), [D] date, [P] DECIMAL(10,5))
;

INSERT Data
([I], [D], [P])
VALUES
('Bread', '2000-01-01', 10),
('Bread', '2000-01-02', 9.5),
('Bread', '2000-01-04', 9.1),
('Sugar', '2000-01-01', 100),
('Sugar', '2000-01-11', 150);

CREATE TABLE Cal([D] DATE);
INSERT Cal VALUES
('2000-01-01'),
('2000-01-02'),
('2000-01-03'),
('2000-01-04'),
('2000-01-05'),
('2000-01-06'),
('2000-01-07'),
('2000-01-08'),
('2000-01-09'),
('2000-01-10'),
('2000-01-11');

SELECT d.i as [item], c.d as [date], d.p as [price] FROM
cal c LEFT JOIN data d ON c.d = d.d
like image 700
Caius Jard Avatar asked Sep 18 '18 13:09

Caius Jard


3 Answers

You can use OUTER APPLY to get the previous and next row with a price that is not null:

select
  d.item,
  d.date,
  case when d.price is null then
    prev.price + ( (next.price - prev.price) /
                   datediff(day, prev.date, next.date) *
                   datediff(day, prev.date, d.date)
                 )
  else
    d.price
  end as price
from data d
outer apply
(
    select top(1) *
    from data d2
    where d2.item = d.item and d2.date < d.date and d2.price is not null
    order by d2.date desc
) prev
outer apply
(
    select top(1) *
    from data d2
    where d2.item = d.item and d2.date > d.date and d2.price is not null
    order by d2.date
) next;

Rextester demo: http://rextester.com/QBL7472

UPDATE: This is probably slow. Perhaps it helps to add and d.price is null to the where clauses in the subqueries to show the DBMS that it doesn't have to actually look for other records when the price is not null. Just check the explain plans to see if that helps.

like image 104
Thorsten Kettner Avatar answered Nov 10 '22 08:11

Thorsten Kettner


it is easier to generate those missing gap together with the Price in one go

So i start off with your original raw data

CREATE TABLE t
    ([I] varchar(5), [D] date, [P] DECIMAL(10,2))
;

INSERT INTO t
    ([I], [D], [P])
VALUES
    ('Bread', '2000-01-01 00:00:00', '10'),
    ('Bread', '2000-01-02 00:00:00', '9.5'),
    ('Bread', '2000-01-04 00:00:00', '9.1'),
    ('Sugar', '2000-01-01 00:00:00', '100'),
    ('Sugar', '2000-01-11 00:00:00', '150');

; with
-- number is a tally table. here i use recursive cte to generate 100 numbers
number as
(
    select  n = 0
    union all
    select  n = n + 1
    from    number
    where   n < 99
),
-- a cte to get the Price of next date and also day diff
cte as
(
    select  *, 
            nextP = lead(P) over(partition by I order by D),
            cnt = datediff(day, D, lead(D) over(partition by I order by D)) - 1
    from    t
) 
select  I, 
        D = dateadd(day, n, D), 
        P = coalesce(c.P + (c.nextP - c.P) / ( cnt + 1) * n, c.P)
from    cte c
        cross join number n
where   n.n <= isnull(c.cnt, 0)

drop table t
like image 22
Squirrel Avatar answered Nov 10 '22 10:11

Squirrel


I would put your formula 100 + (150 - 100 / 10) * 9) etc into a scalar UDF and use it in a persisted computed column.

like image 1
Tab Alleman Avatar answered Nov 10 '22 09:11

Tab Alleman