Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elegant way to sum values by time intervals (whilst accounting for missing values)

Tags:

r

I'm trying to take something like this

df <- data.frame(times = c("0915", "0930", "0945", "1000", "1015", "1030", "1045", "1100", "1130", "1145", "1200"),
                 values = c(1,2,3,4,1,2,3,4,1,3,4))

> df
   times values
1   0915      1
2   0930      2
3   0945      3
4   1000      4
5   1015      1
6   1030      2
7   1045      3
8   1100      4
9   1130      1
10  1145      3
11  1200      4
12  1215      1
13  1245      3
14  1300      4
15  1330      2
16  1345      4

And turn it into something like this

> df2
  times values
1   0930      3
2   1000      7
3   1030      3
4   1100      7
5   1130     NA
6   1200      7
7   1230     NA
8   1300      7
9   1330     NA
10  1400     NA

Essentially, take values measured in 15 minute intervals, and convert them into values measured across 30 minute intervals (summing is sufficient for this).

I can think of an okay solution if I can be certain I have two 15 minute readings for each half hourly reading. I could just add elements pairwise and get what I want. But I can't be certain of that in my data set. As my demo also shows, there could be multiple consecutive values missing.

So I thought some kind of number recognition was necessary, e.g. recognises the time is between 9:15 and 9:30, and just sums those two. So I have a function already called hr2dec which I created to convert these times to decimal so it looks like this

> hr2dec(df$times)
 [1]  9.25  9.50  9.75 10.00 10.25 10.50 10.75 11.00 11.50 11.75 12.00

I mention this in case it's easier to solve this problem with decimal instead of 4 digit time.

I also have this data for 24 hours, and multiple days. So if I have a solution that loops, it would need to reset to 0015 after 2400, as these are the first and last measurements for each day. A full set of data with dates included could be generated like so (with decimals for times, like I said, either is fine for me):

set.seed(42)
full_df <- data.frame(date = rep(as.Date(c("2010-02-02", "2010-02-03")), each = 96),
                      dec_times = seq(0.25,24,0.25),
                      values = rnorm(96)
                      )

full_df <- full_df[-c(2,13,15,19,95,131,192),]

The best solution I can come up with so far is a pairwise comparative loop. But even this is not perfect.

Is there some elegant way to do what I'm after? I.e. check the first and last values (in terms of date and time), and sum each half hourly interval? I'm not satisfied with my loop that...

  1. Checks first and last date-time value to work out the range of half hours
  2. Checks items in order, pair at a time to decide whether or not I have two values that belong to that half hourly period.
  3. Sums if I do, places NA if I do not.
like image 760
LachlanO Avatar asked Jan 31 '26 23:01

LachlanO


1 Answers

You should check out the tibbletime package -- specifically, you'll want to look at collapse_by() which collapses a tbl_time object by a time period.

library(tibbletime)
library(dplyr)

# create a series of 7 days
# 2018-01-01 to 2018-01-07 by 15 minute intervals
df <- create_series('2018-01-01' ~ '2018-01-07', period = "15 minute")
df$values <- rnorm(nrow(df))
df
#> # A time tibble: 672 x 2
#> # Index: date
#>    date                 values
#>    <dttm>                <dbl>
#>  1 2018-01-01 00:00:00 -0.365 
#>  2 2018-01-01 00:15:00 -0.275 
#>  3 2018-01-01 00:30:00 -1.50  
#>  4 2018-01-01 00:45:00 -1.64  
#>  5 2018-01-01 01:00:00 -0.341 
#>  6 2018-01-01 01:15:00 -1.05  
#>  7 2018-01-01 01:30:00 -0.544 
#>  8 2018-01-01 01:45:00 -1.10  
#>  9 2018-01-01 02:00:00  0.0824
#> 10 2018-01-01 02:15:00  0.477 
#> # ... with 662 more rows

# Collapse into 30 minute intervals, group, and sum
df %>% 
  collapse_by("30 minute") %>%
  group_by(date) %>%
  summarise(sum_values = sum(values))
#> # A time tibble: 336 x 2
#> # Index: date
#>    date                sum_values
#>    <dttm>                   <dbl>
#>  1 2018-01-01 00:15:00     -0.640
#>  2 2018-01-01 00:45:00     -3.14 
#>  3 2018-01-01 01:15:00     -1.39 
#>  4 2018-01-01 01:45:00     -1.64 
#>  5 2018-01-01 02:15:00      0.559
#>  6 2018-01-01 02:45:00      0.581
#>  7 2018-01-01 03:15:00     -1.50 
#>  8 2018-01-01 03:45:00      1.36 
#>  9 2018-01-01 04:15:00      0.872
#> 10 2018-01-01 04:45:00     -0.835
#> # ... with 326 more rows

# Alternatively, you can use clean = TRUE
df %>% 
  collapse_by("30 minute", clean = TRUE) %>%
  group_by(date) %>%
  summarise(sum_values = sum(values))
#> # A time tibble: 336 x 2
#> # Index: date
#>    date                sum_values
#>    <dttm>                   <dbl>
#>  1 2018-01-01 00:30:00     -0.640
#>  2 2018-01-01 01:00:00     -3.14 
#>  3 2018-01-01 01:30:00     -1.39 
#>  4 2018-01-01 02:00:00     -1.64 
#>  5 2018-01-01 02:30:00      0.559
#>  6 2018-01-01 03:00:00      0.581
#>  7 2018-01-01 03:30:00     -1.50 
#>  8 2018-01-01 04:00:00      1.36 
#>  9 2018-01-01 04:30:00      0.872
#> 10 2018-01-01 05:00:00     -0.835
#> # ... with 326 more rows

If you're more into videos (< 20 minutes), check out the The Future of Time Series and Financial Analysis in the Tidyverse by David Vaughan.

like image 76
JasonAizkalns Avatar answered Feb 03 '26 12:02

JasonAizkalns