Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I select one row of data per hour, from a table of time stamps?

Excuse me if this is confusing, as I am not very familiar with postgresql. I have a postgres database with a table full of "sites". Each site reports about once an hour, and when it reports, it makes an entry in this table, like so:

site |      tstamp
-----+--------------------
6000 | 2013-05-09 11:53:04
6444 | 2013-05-09 12:58:00
6444 | 2013-05-09 13:01:08
6000 | 2013-05-09 13:01:32
6000 | 2013-05-09 14:05:06
6444 | 2013-05-09 14:06:25
6444 | 2013-05-09 14:59:58
6000 | 2013-05-09 19:00:07

As you can see, the time stamps are almost never on-the-nose, and sometimes there will be 2 or more within only a few minutes/seconds of each other. Furthermore, some sites won't report for hours at a time (on occasion). I want to only select one entry per site, per hour (as close to each hour as I can get). How can I go about doing this in an efficient way? I also will need to extend this to other time frames (like one entry per site per day -- as close to midnight as possible).

Thank you for any and all suggestions.

like image 940
BLuFeNiX Avatar asked May 09 '13 19:05

BLuFeNiX


3 Answers

You could use DISTINCT ON:

select distinct on (date_trunc('hour', tstamp)) site, tstamp
from t
order by date_trunc('hour', tstamp), tstamp

Be careful with the ORDER BY if you care about which entry you get.

Alternatively, you could use the row_number window function to mark the rows of interest and then peel off the first result in each group from a derived table:

select site, tstamp
from (
    select site, tstamp,
           row_number() over (partition by date_trunc('hour', tstamp) order by tstamp) as r
    from t
) as dt
where r = 1

Again, you'd adjust the ORDER BY to select the specific row of interest for each date.

like image 129
mu is too short Avatar answered Sep 24 '22 06:09

mu is too short


You are looking for the closest value per hour. Some are before the hour and some are after. That makes this a hardish problem.

First, we need to identify the range of values that work for a particular hour. For this, I'll consider anything from 15 minutes before the hour to 45 minutes after as being for that hour. So, the period of consideration for 2:00 goes from 1:45 to 2:45 (arbitrary, but seems reasonable for your data). We can do this by shifting the time stamps by 15 minutes.

Second, we need to get the closest value to the hour. So, we prefer 1:57 to 2:05. We can do this by considering the first value in (57, 60 - 57, 5, 60 - 5).

We can put these rules into a SQL statement, using row_number():

select site, tstamp, usedTimestamp
from (select site, tstamp,
             date_trunc('hour', tstamp + 'time 00:15') as usedTimestamp
             row_number() over (partition by site, to_char(tstamp + time '00:15', 'YYYY-MM-DD-HH24'),
                                order by least(extract(minute from tstamp), 60 - extract(minute from tstamp))
                               ) as seqnum
      from t
     ) as dt
where seqnum = 1;
like image 45
Gordon Linoff Avatar answered Sep 22 '22 06:09

Gordon Linoff


For the extensibility aspect of your question.

I also will need to extend this to other time frames (like one entry per site per day

From the distinct set of site ids, and using a (recursive) CTE, I would build a set comprised of one entry per site per hour (or other specified interval), within a specified StartDateTime, EndDateTime range.

          SITE..THE DATE-TIME-HOUR
          6000  12.1.2013 00:00:00
          6000  12.1.2013 01:00:00
          .
          .
          . 
          6000  12.1.2013 24:00:00              
          7000  12.1.2013 00:00:00        
          7000  12.1.2013 01:00:00
          .
          .
          . 
          7000  12.1.2013 24:00:00

Then I would left join that CTE against your SITES log on site id and on the min absolute difference between the CTE point-in-time and the LOG's point-in-time.

That way you are assured of a row for each site per interval.

P.S. For a site that has not phoned home for a long time, its most recent phone-in timestamp will be repeated multiple times as the closest one available.

like image 23
Tim Avatar answered Sep 26 '22 06:09

Tim