I have found variations of this question, and I know modulos can possibly be used but I am having a difficult time putting it all together.
I have a sequence of observations by ID and seconds. When the cumulitive amount of seconds by id increments greater than 5 seconds, I would like to restart the count. Could someone help me answer this question in dplyr?
Original df
df <- data.frame(id = c(1,1,1,1,1,2,2,2,2,3,3,3,3),
val = c(2,10,12,15,17,2,4,7,8,12,15,20,25))
df
id val
1 1 2
2 1 10
3 1 12
4 1 15
5 1 17
6 2 2
7 2 4
8 2 7
9 2 8
10 3 12
11 3 15
12 3 20
13 3 25
Desired Outcome
finalResult
id val reset
1 1 2 1
2 1 10 2
3 1 12 2
4 1 15 3
5 1 17 3
6 2 2 1
7 2 4 1
8 2 7 2
9 2 8 2
10 3 12 1
11 3 15 1
12 3 20 2
13 3 25 3
Edit
Thanks for the responses yesterday but I encountered some problems with the given solutions.
On this data set the code works on some instances.
sub.df <- structure(list(`ID` = c("1",
"1", "1",
"1", "1",
"1", "1",
"1", "1"
), dateFormat = structure(c(1479955726, 1479955726, 1483703713,
1495190809, 1495190809, 1497265079, 1497265079, 1474023059, 1474023061
), class = c("POSIXct", "POSIXt"), tzone = "America/Chicago")), .Names = c("ID",
"dateFormat"), row.names = c(NA, -9L), class = c("tbl_df", "tbl",
"data.frame"))
Solution Used:
jj <- sub.df %>%
group_by(`ID`) %>%
arrange(`ID`,`dateFormat`)%>%
mutate(totalTimeInt = difftime(dateFormat,first(dateFormat),units = 'secs'))%>%
mutate(totalTimeFormat = as.numeric(totalTimeInt))%>%
mutate(reset = cumsum(
Reduce(
function(x, y)
if (x + y >= 5) 0
else x + y,
diff(totalTimeFormat), init = 0, accumulate = TRUE
) == 0
))%>%
mutate(reset_2 = cumsum(
accumulate(
diff(totalTimeFormat),
~if (.x + .y >= 5) 0 else .x + .y,
.init = 0
) == 0
))
Outcome
# A tibble: 9 x 6
# Groups: ID [1]
ID dateFormat totalTimeInt totalTimeFormat reset reset_2
<chr> <dttm> <time> <dbl> <int> <int>
1 1 2016-09-16 05:50:59 0 secs 0 1 1
2 1 2016-09-16 05:51:01 2 secs 2 1 1
3 1 2016-11-23 20:48:46 5932667 secs 5932667 2 2
4 1 2016-11-23 20:48:46 5932667 secs 5932667 3 3
5 1 2017-01-06 05:55:13 9680654 secs 9680654 4 4
6 1 2017-05-19 05:46:49 21167750 secs 21167750 5 5
7 1 2017-05-19 05:46:49 21167750 secs 21167750 6 6
8 1 2017-06-12 05:57:59 23242020 secs 23242020 7 7
9 1 2017-06-12 05:57:59 23242020 secs 23242020 8 8
What happens is that for the first two observation it correctly counts that as 1 instance. When it reaches the third and fourth observation this should only be counted as two observations as there was essentially no time that passed between those two instances.
Correct Output:
# A tibble: 9 x 6
# Groups: ID [1]
ID dateFormat totalTimeInt totalTimeFormat reset reset_2
<chr> <dttm> <time> <dbl> <int> <int>
1 1 2016-09-16 05:50:59 0 secs 0 1 1
2 1 2016-09-16 05:51:01 2 secs 2 1 1
3 1 2016-11-23 20:48:46 5932667 secs 5932667 2 2
4 1 2016-11-23 20:48:46 5932667 secs 5932667 2 2
5 1 2017-01-06 05:55:13 9680654 secs 9680654 3 3
6 1 2017-05-19 05:46:49 21167750 secs 21167750 4 4
7 1 2017-05-19 05:46:49 21167750 secs 21167750 4 4
8 1 2017-06-12 05:57:59 23242020 secs 23242020 5 5
9 1 2017-06-12 05:57:59 23242020 secs 23242020 5 5
If you use Reduce
with accumulate = TRUE
(or purrr::accumulate
, if you prefer), you can reset the running difference when it is greater than or equal to 5. Calling cumsum
on whether that total is 0 will return the number of resets.
library(tidyverse)
df <- data.frame(id = c(1,1,1,1,1,2,2,2,2,3,3,3,3),
val = c(2,10,12,15,17,2,4,7,8,12,15,20,25))
df %>%
group_by(id) %>%
mutate(reset = cumsum(
Reduce(
function(x, y) if (x + y >= 5) 0 else x + y,
diff(val), init = 0, accumulate = TRUE
) == 0
))
#> # A tibble: 13 x 3
#> # Groups: id [3]
#> id val reset
#> <dbl> <dbl> <int>
#> 1 1 2 1
#> 2 1 10 2
#> 3 1 12 2
#> 4 1 15 3
#> 5 1 17 3
#> 6 2 2 1
#> 7 2 4 1
#> 8 2 7 2
#> 9 2 8 2
#> 10 3 12 1
#> 11 3 15 1
#> 12 3 20 2
#> 13 3 25 3
or with purrr::accumulate
,
df %>%
group_by(id) %>%
mutate(reset = cumsum(
accumulate(
diff(val),
~if (.x + .y >= 5) 0 else .x + .y,
.init = 0
) == 0
))
#> # A tibble: 13 x 3
#> # Groups: id [3]
#> id val reset
#> <dbl> <dbl> <int>
#> 1 1 2 1
#> 2 1 10 2
#> 3 1 12 2
#> 4 1 15 3
#> 5 1 17 3
#> 6 2 2 1
#> 7 2 4 1
#> 8 2 7 2
#> 9 2 8 2
#> 10 3 12 1
#> 11 3 15 1
#> 12 3 20 2
#> 13 3 25 3
Regarding the edit, the issue is that some of the diffs are 0, which is the same as what it's counting to see resets. The simplest solution is to use NA
instead of zero as a reset value:
library(tidyverse)
sub.df <- structure(list(`ID` = c("1", "1", "1", "1", "1", "1", "1", "1", "1"),
dateFormat = structure(c(1479955726, 1479955726, 1483703713,
1495190809, 1495190809, 1497265079, 1497265079, 1474023059, 1474023061),
class = c("POSIXct", "POSIXt"), tzone = "America/Chicago")),
.Names = c("ID", "dateFormat"), row.names = c(NA, -9L),
class = c("tbl_df", "tbl", "data.frame"))
sub.df %>%
group_by(ID) %>%
arrange(ID, dateFormat) %>%
mutate(reset = cumsum(is.na(
accumulate(diff(dateFormat),
~{
s <- sum(.x, .y, na.rm = TRUE);
if (s >= 5) NA else s
},
.init = NA)
)))
#> # A tibble: 9 x 3
#> # Groups: ID [1]
#> ID dateFormat reset
#> <chr> <dttm> <int>
#> 1 1 2016-09-16 05:50:59 1
#> 2 1 2016-09-16 05:51:01 1
#> 3 1 2016-11-23 20:48:46 2
#> 4 1 2016-11-23 20:48:46 2
#> 5 1 2017-01-06 05:55:13 3
#> 6 1 2017-05-19 05:46:49 4
#> 7 1 2017-05-19 05:46:49 4
#> 8 1 2017-06-12 05:57:59 5
#> 9 1 2017-06-12 05:57:59 5
Ultimately this approach faces limitations, too, though, as if any values actually are NA
, it will increment similarly. A more robust solution would be to return a list of two elements from each iteration, one for the total with resets, and one for the reset count. This is more work to implement, though:
sub.df %>%
group_by(ID) %>%
arrange(ID, dateFormat) %>%
mutate(total_reset = accumulate(
transpose(list(total = diff(dateFormat), reset = rep(0, n() - 1))),
~{
s <- .x$total + .y$total;
if (s >= 5) {
data_frame(total = 0, reset = .x$reset + 1)
} else {
data_frame(total = s, reset = .x$reset)
}
},
.init = data_frame(total = 0, reset = 1)
)) %>%
unnest()
#> # A tibble: 9 x 4
#> # Groups: ID [1]
#> ID dateFormat total reset
#> <chr> <dttm> <dbl> <dbl>
#> 1 1 2016-09-16 05:50:59 0 1
#> 2 1 2016-09-16 05:51:01 2 1
#> 3 1 2016-11-23 20:48:46 0 2
#> 4 1 2016-11-23 20:48:46 0 2
#> 5 1 2017-01-06 05:55:13 0 3
#> 6 1 2017-05-19 05:46:49 0 4
#> 7 1 2017-05-19 05:46:49 0 4
#> 8 1 2017-06-12 05:57:59 0 5
#> 9 1 2017-06-12 05:57:59 0 5
The total looks a little silly, but if you look at the diff, it's actually correct.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With