Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient time series querying in Postgres

Tags:

sql

postgresql

I have a table in my PG db that looks somewhat like this:

id | widget_id | for_date | score |

Each referenced widget has a lot of these items. It's always 1 per day per widget, but there are gaps.

What I want to get is a result that contains all the widgets for each date since X. The dates are brought in via generate series:

 SELECT date.date::date
   FROM generate_series('2012-01-01'::timestamp with time zone,'now'::text::date::timestamp with time zone, '1 day') date(date)
 ORDER BY date.date DESC;

If there is no entry for a date for a given widget_id, I want to use the previous one. So say widget 1337 doesn't have an entry on 2012-05-10, but on 2012-05-08, then I want the resultset to show the 2012-05-08 entry on 2012-05-10 as well:

Actual data:
widget_id | for_date   | score
1312      | 2012-05-07 | 20
1337      | 2012-05-07 | 12
1337      | 2012-05-08 | 41
1337      | 2012-05-11 | 500

Desired output based on generate series:
widget_id | for_date   | score
1336      | 2012-05-07 | 20
1337      | 2012-05-07 | 12
1336      | 2012-05-08 | 20
1337      | 2012-05-08 | 41
1336      | 2012-05-09 | 20
1337      | 2012-05-09 | 41
1336      | 2012-05-10 | 20
1337      | 2012-05-10 | 41
1336      | 2012-05-11 | 20
1337      | 2012-05-11 | 500

Eventually I want to boil this down into a view so I have consistent data sets per day that I can query easily.

Edit: Made the sample data and expected resultset clearer

like image 415
TheDeadSerious Avatar asked Feb 14 '13 12:02

TheDeadSerious


2 Answers

SQL Fiddle

select
    widget_id,
    for_date,
    case
        when score is not null then score
        else first_value(score) over (partition by widget_id, c order by for_date)
        end score
from (
    select
        a.widget_id,
        a.for_date,
        s.score,
        count(score) over(partition by a.widget_id order by a.for_date) c
    from (
        select widget_id, g.d::date for_date
        from (
            select distinct widget_id
            from score
            ) s
            cross join
            generate_series(
                (select min(for_date) from score),
                (select max(for_date) from score),
                '1 day'
            ) g(d)
        ) a
        left join
        score s on a.widget_id = s.widget_id and a.for_date = s.for_date
) s
order by widget_id, for_date
like image 69
Clodoaldo Neto Avatar answered Sep 21 '22 18:09

Clodoaldo Neto


First of all, you can have a much simpler generate_series() table expression. Equivalent to yours (except for descending order, that contradicts the rest of your question anyways):

SELECT generate_series('2012-01-01'::date, now()::date, '1d')::date

The type date is coerced to timestamptz automatically on input. The return type is timestamptz either way. I use a subquery below, so I can cast to the output to date right away.

Next, max() as window function returns exactly what you need: the highest value since frame start ignoring NULL values. Building on that, you get a radically simple query.

For a given widget_id

Most likely faster than involving CROSS JOIN or WITH RECURSIVE:

SELECT a.day, s.*
FROM  (
   SELECT d.day
         ,max(s.for_date) OVER (ORDER BY d.day) AS effective_date
   FROM  (
      SELECT generate_series('2012-01-01'::date, now()::date, '1d')::date
      ) d(day)
   LEFT   JOIN score s ON s.for_date = d.day
                      AND s.widget_id = 1337 -- "for a given widget_id"
   ) a
LEFT   JOIN score s ON s.for_date = a.effective_date
                   AND s.widget_id = 1337
ORDER  BY a.day;

->sqlfiddle

With this query you can put any column from score you like into the final SELECT list. I put s.* for simplicity. Pick your columns.

If you want to start your output with the first day that actually has a score, simply replace the last LEFT JOIN with JOIN.

Generic form for all widget_id's

Here I use a CROSS JOIN to produce a row for every widget on every date ..

SELECT a.day, a.widget_id, s.score
FROM  (
   SELECT d.day, w.widget_id
         ,max(s.for_date) OVER (PARTITION BY w.widget_id
                                ORDER BY d.day) AS effective_date
   FROM  (SELECT generate_series('2012-05-05'::date
                                ,'2012-05-15'::date, '1d')::date AS day) d
   CROSS  JOIN (SELECT DISTINCT widget_id FROM score) AS w
   LEFT   JOIN score s ON s.for_date = d.day AND s.widget_id = w.widget_id
   ) a
JOIN  score s ON s.for_date = a.effective_date
             AND s.widget_id = a.widget_id  -- instead of LEFT JOIN
ORDER BY a.day, a.widget_id;

->sqlfiddle

like image 32
Erwin Brandstetter Avatar answered Sep 19 '22 18:09

Erwin Brandstetter