I have a data-set which contains observations for several weeks with 2 minutes frequency. I want to increase the time interval from 2 minute to 5 minute. The problem is that, frequency of the observations are not always the same. I mean, theoretically, every 10 minute there should be 5 observation but usually it is not the case. Please let me know how I can aggregate the observations based on average function and with respect to the time and date of the observations. In other words aggregation based on every 5 minutes while number of observations are not the same for each 5 minute time interval. Moreover, I have date and time in timestamp format.
Example Data:
1 2007-09-14 22:56:12 5.39
2 2007-09-14 22:58:12 5.34
3 2007-09-14 23:00:12 5.16
4 2007-09-14 23:02:12 5.54
5 2007-09-14 23:04:12 5.30
6 2007-09-14 23:06:12 5.20
expected results:
1 2007-09-14 23:00 5.29
2 2007-09-14 23:05 5.34
You can use the date and time data types with the MIN() , MAX() , COUNT() functions, the DISTINCT argument to those functions, and the GROUP BY argument to the SELECT() function. An attempt to use SUM() or AVG() with date or time data types returns an error.
use the keyword COUNT to count the number of rows in a column or table. use the keyword AVG to find the mean of a numerical column. use the keyword SUM to find the total of a numerical column when all the values are added together. use the keyword GROUP BY to group by a column in a table.
An SQL aggregate function calculates on a set of values and returns a single value. For example, the average function ( AVG ) takes a list of values and returns the average. Because an aggregate function operates on a set of values, it is often used with the GROUP BY clause of the SELECT statement.
The answers to this question likely provide good solutions to your problem, showing ways to efficiently aggregate data into time windows.
Essentially, use the avg
aggregate with:
GROUP BY floor(extract(epoch from the_timestamp) / 60 / 5)
Ok, so this is just one way to handle this. I hope this gets you thinking about how to convert the data for you analysis needs.
There's a prerequisite to test out this code. You need to have a table with all possible 1-minute timestamps. There are many ways to go about this, I'll just use what I have available, which is one table: dim_time which has each minute (00:01:00) through (23:59:00) and another table with all possible dates (dim_date). When you join these (1=1) you get all possible minutes for all possible days.
--first you need to create some functions I'll use later
--credit to this first function goes to David Walling
CREATE OR REPLACE FUNCTION dev.beginning_datetime_floor(timestamp without time zone, integer)
RETURNS timestamp without time zone AS
$BODY$
SELECT
date_trunc('minute',timestamp with time zone 'epoch' +
floor(extract(epoch from $1)/($2*60))*$2*60
* interval '1 second') at time zone 'CST6CDT'
$BODY$
LANGUAGE sql VOLATILE;
--the following function is what I described on my previous post
CREATE OR REPLACE FUNCTION dev.round_minutes(timestamp without time zone, integer)
RETURNS timestamp without time zone AS
$BODY$
SELECT date_trunc('hour', $1) + cast(($2::varchar||' min') as interval) * round(date_part('minute',$1)::float / cast($2 as float))
$BODY$
LANGUAGE sql VOLATILE;
--let's load the data into a temp table, I added some data points. note: i got rid of the partial seconds
SELECT cast(timestamp_original as timestamp) as timestamp_original, datapoint INTO TEMPORARY TABLE timestamps_second2
FROM
(
SELECT '2007-09-14 22:56:12' as timestamp_original, 0 as datapoint
UNION
SELECT '2007-09-14 22:58:12' as timestamp_original, 1 as datapoint
UNION
SELECT '2007-09-14 23:00:12' as timestamp_original, 10 as datapoint
UNION
SELECT '2007-09-14 23:02:12' as timestamp_original, 100 as datapoint
UNION
SELECT '2007-09-14 23:04:12' as timestamp_original, 1000 as datapoint
UNION
SELECT '2007-09-14 23:06:12' as timestamp_original, 10000 as datapoint
) as data
--this is the bit of code you'll have to replace with your implementation of getting all possible minutes
--you could make some sequence of timestamps in R, or simply make the timestamps in Excel to test out the rest of the code
--the result of the query is simply '2007-09-14 00:00:00' through '2007-09-14 23:59:00'
SELECT * INTO TEMPORARY TABLE possible_timestamps
FROM
(
select the_date + beginning_minute as minute_timestamp
FROM datawarehouse.dim_date as dim_date
JOIN datawarehouse.dim_time as dim_time
ON 1=1
where dim_date.the_date = '2007-09-14'
group by the_date, beginning_minute
order by the_date, beginning_minute
) as data
--round to nearest minute (be sure to think about how this might change your results
SELECT * INTO TEMPORARY TABLE rounded_timestamps2
FROM
(
SELECT dev.round_minutes(timestamp_original,1) as minute_timestamp_rounded, datapoint
from timestamps_second2
) as data
--let's join what minutes we have data for versus the possible minutes
--I used some subqueries so when you select all from the table you'll see the important part (not needed)
SELECT * INTO TEMPORARY TABLE joined_with_possibles
FROM
(
SELECT *
FROM
(
SELECT *, (MIN(minute_timestamp_rounded) OVER ()) as min_time, (MAX(minute_timestamp_rounded) OVER ()) as max_time
FROM possible_timestamps as t1
LEFT JOIN rounded_timestamps2 as t2
ON t1.minute_timestamp = t2.minute_timestamp_rounded
ORDER BY t1.minute_timestamp asc
) as inner_query
WHERE minute_timestamp >= min_time
AND minute_timestamp <= max_time
) as data
--here's the tricky part that might not suit your needs, but it's one method
--if it's missing a value it grabs the previous value
--if it's missing the prior value it grabs the one before that, otherwise it's null
--best practice would be run another case statement with 0,1,2 specifying which point was pulled, then you can count those when you aggregate
SELECT * INTO TEMPORARY TABLE shifted_values
FROM
(
SELECT
*,
case
when datapoint is not null then datapoint
when datapoint is null and (lag(datapoint,1) over (order by minute_timestamp asc)) is not null
then lag(datapoint,1) over (order by minute_timestamp asc)
when datapoint is null and (lag(datapoint,1) over (order by minute_timestamp asc)) is null and (lag(datapoint,2) over (order by minute_timestamp asc)) is not null
then lag(datapoint,2) over (order by minute_timestamp asc)
else null end as last_good_value
from joined_with_possibles
ORDER BY minute_timestamp asc
) as data
--now we use the function from my previous post to make the timestamps to aggregate on
SELECT * INTO TEMPORARY TABLE shifted_values_with_five_minute
FROM
(
SELECT *, dev.beginning_datetime_floor(minute_timestamp,5) as five_minute_timestamp
FROM shifted_values
) as data
--finally we aggregate
SELECT
AVG(datapoint) as avg_datapoint, five_minute_timestamp
FROM shifted_values_with_five_minute
GROUP BY five_minute_timestamp
By far the simplest option is to create a reference table. In that table you store the intervals over which you are insterested:
(Adapt this to your own RDBMS's date notation.)
CREATE TABLE interval (
start_time DATETIME,
cease_time DATETIME
);
INSERT INTO interval SELECT '2012-10-22 12:00', '2012-10-22 12:05';
INSERT INTO interval SELECT '2012-10-22 12:05', '2012-10-22 12:10';
INSERT INTO interval SELECT '2012-10-22 12:10', '2012-10-22 12:15';
INSERT INTO interval SELECT '2012-10-22 12:15', '2012-10-22 12:20';
INSERT INTO interval SELECT '2012-10-22 12:20', '2012-10-22 12:25';
INSERT INTO interval SELECT '2012-10-22 12:25', '2012-10-22 12:30';
INSERT INTO interval SELECT '2012-10-22 12:30', '2012-10-22 12:35';
INSERT INTO interval SELECT '2012-10-22 12:35', '2012-10-22 12:40';
Then you just join and aggregate...
SELECT
interval.start_time,
AVG(observation.value)
FROM
interval
LEFT JOIN
observation
ON observation.timestamp >= interval.start_time
AND observation.timestamp < interval.cease_time
GROUP BY
interval.start_time
NOTE: You only need to create and populate that intervals table once, then you can re-use it many times.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With