Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Impute via fill-forward/LOCF a column over a range of sequential rows in SQL?

When formatting data for time-series analysis, a common need is to impute missing values by filling-forward values over time (also called Last-Observation-Carried-Forward / LOCF ).

While data-analysis environments typically provide that functionality (e.g. Pandas fillna() ), for larger data-sets it can be more efficient to compute it in SQL (e.g. utilizing a data-parallel data-warehouse appliance).

For example, consider:

    | UNIT | TIME | VALUE |
    |------|------|-------|
    | 1    | 1    | NULL  |
    | 1    | 2    | .5    |
    | 1    | 3    | NULL  |
    | 1    | 4    | NULL  |
    | 1    | 5    | .2    |
    | 1    | 6    | NULL  |
    | 2    | 1    | .6    |
    | 2    | 2    | NULL  |

which, after filling-forward the VALUE column over TIME (independently for each UNIT) yields:

    | UNIT | TIME | VALUE |
    |------|------|-------|
    | 1    | 1    | NULL  |
    | 1    | 2    | .5    |
    | 1    | 3    | .5    |
    | 1    | 4    | .5    |
    | 1    | 5    | .2    |
    | 1    | 6    | .2    |
    | 2    | 1    | .6    |
    | 2    | 2    | .6    |

(Notice the initial NULL for UNIT 1 can't be imputed as there is no prior value)

Time could also be a timestamp or datetime type column.

like image 229
DavidJ Avatar asked Apr 13 '15 18:04

DavidJ


1 Answers

With some databases, e.g., Postgres, you can define your own aggregation functions. LOCF is just a running COALESCE.

CREATE OR REPLACE FUNCTION locf_state( FLOAT, FLOAT )
RETURNS FLOAT
LANGUAGE SQL
AS $f$
  SELECT COALESCE($2,$1)
$f$;

CREATE AGGREGATE locf(FLOAT) (
  SFUNC = locf_state,
  STYPE = FLOAT
);

The query is then more readable:

SELECT unit, time, 
       locf(value) OVER( PARTITION BY unit ORDER BY time )
FROM   mytable;

SQLFiddle: http://sqlfiddle.com/#!15/2c73b/1/0

like image 140
Vincent Zoonekynd Avatar answered Sep 28 '22 03:09

Vincent Zoonekynd