When formatting data for time-series analysis, a common need is to impute missing values by filling-forward values over time (also called Last-Observation-Carried-Forward / LOCF ).
While data-analysis environments typically provide that functionality (e.g. Pandas fillna()
), for larger data-sets it can be more efficient to compute it in SQL (e.g. utilizing a data-parallel data-warehouse appliance).
For example, consider:
| UNIT | TIME | VALUE |
|------|------|-------|
| 1 | 1 | NULL |
| 1 | 2 | .5 |
| 1 | 3 | NULL |
| 1 | 4 | NULL |
| 1 | 5 | .2 |
| 1 | 6 | NULL |
| 2 | 1 | .6 |
| 2 | 2 | NULL |
which, after filling-forward the VALUE column over TIME (independently for each UNIT) yields:
| UNIT | TIME | VALUE |
|------|------|-------|
| 1 | 1 | NULL |
| 1 | 2 | .5 |
| 1 | 3 | .5 |
| 1 | 4 | .5 |
| 1 | 5 | .2 |
| 1 | 6 | .2 |
| 2 | 1 | .6 |
| 2 | 2 | .6 |
(Notice the initial NULL for UNIT 1 can't be imputed as there is no prior value)
Time could also be a timestamp or datetime type column.
With some databases, e.g., Postgres, you can define your own aggregation functions. LOCF is just a running COALESCE.
CREATE OR REPLACE FUNCTION locf_state( FLOAT, FLOAT )
RETURNS FLOAT
LANGUAGE SQL
AS $f$
SELECT COALESCE($2,$1)
$f$;
CREATE AGGREGATE locf(FLOAT) (
SFUNC = locf_state,
STYPE = FLOAT
);
The query is then more readable:
SELECT unit, time,
locf(value) OVER( PARTITION BY unit ORDER BY time )
FROM mytable;
SQLFiddle: http://sqlfiddle.com/#!15/2c73b/1/0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With