Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to repeat sequence when condition is met

Tags:

r

dplyr

I have found variations of this question, and I know modulos can possibly be used but I am having a difficult time putting it all together.

I have a sequence of observations by ID and seconds. When the cumulitive amount of seconds by id increments greater than 5 seconds, I would like to restart the count. Could someone help me answer this question in dplyr?

Original df

df <- data.frame(id = c(1,1,1,1,1,2,2,2,2,3,3,3,3), 
                 val = c(2,10,12,15,17,2,4,7,8,12,15,20,25))

df
   id val
1   1   2
2   1  10
3   1  12
4   1  15
5   1  17
6   2   2
7   2   4
8   2   7
9   2   8
10  3  12
11  3  15
12  3  20
13  3  25

Desired Outcome

finalResult
   id val reset
1   1   2     1
2   1  10     2
3   1  12     2
4   1  15     3
5   1  17     3
6   2   2     1
7   2   4     1
8   2   7     2
9   2   8     2
10  3  12     1
11  3  15     1
12  3  20     2
13  3  25     3

Edit

Thanks for the responses yesterday but I encountered some problems with the given solutions.

On this data set the code works on some instances.

sub.df <- structure(list(`ID` = c("1", 
                                                "1", "1", 
                                                "1", "1", 
                                                "1", "1", 
                                                "1", "1"
), dateFormat = structure(c(1479955726, 1479955726, 1483703713, 
                            1495190809, 1495190809, 1497265079, 1497265079, 1474023059, 1474023061
), class = c("POSIXct", "POSIXt"), tzone = "America/Chicago")), .Names = c("ID", 
                                                                           "dateFormat"), row.names = c(NA, -9L), class = c("tbl_df", "tbl", 
                                                                                                                            "data.frame")) 

Solution Used:

jj <- sub.df %>% 
  group_by(`ID`) %>% 
  arrange(`ID`,`dateFormat`)%>%
  mutate(totalTimeInt = difftime(dateFormat,first(dateFormat),units = 'secs'))%>%
  mutate(totalTimeFormat   = as.numeric(totalTimeInt))%>%
  mutate(reset = cumsum(
    Reduce(
      function(x, y) 
        if (x + y >= 5) 0 
        else x + y, 

        diff(totalTimeFormat), init = 0, accumulate = TRUE
    ) == 0
  ))%>%
  mutate(reset_2 = cumsum(
    accumulate(
      diff(totalTimeFormat), 
      ~if (.x + .y >= 5) 0 else .x + .y, 
      .init = 0
    ) == 0
  ))

Outcome

# A tibble: 9 x 6
# Groups:   ID [1]
     ID          dateFormat  totalTimeInt totalTimeFormat reset reset_2
  <chr>              <dttm>        <time>           <dbl> <int>   <int>
1     1 2016-09-16 05:50:59        0 secs               0     1       1
2     1 2016-09-16 05:51:01        2 secs               2     1       1
3     1 2016-11-23 20:48:46  5932667 secs         5932667     2       2
4     1 2016-11-23 20:48:46  5932667 secs         5932667     3       3
5     1 2017-01-06 05:55:13  9680654 secs         9680654     4       4
6     1 2017-05-19 05:46:49 21167750 secs        21167750     5       5
7     1 2017-05-19 05:46:49 21167750 secs        21167750     6       6
8     1 2017-06-12 05:57:59 23242020 secs        23242020     7       7
9     1 2017-06-12 05:57:59 23242020 secs        23242020     8       8

What happens is that for the first two observation it correctly counts that as 1 instance. When it reaches the third and fourth observation this should only be counted as two observations as there was essentially no time that passed between those two instances.

Correct Output:

# A tibble: 9 x 6
# Groups:   ID [1]
     ID          dateFormat  totalTimeInt totalTimeFormat reset reset_2
  <chr>              <dttm>        <time>           <dbl> <int>   <int>
1     1 2016-09-16 05:50:59        0 secs               0     1       1
2     1 2016-09-16 05:51:01        2 secs               2     1       1
3     1 2016-11-23 20:48:46  5932667 secs         5932667     2       2
4     1 2016-11-23 20:48:46  5932667 secs         5932667     2       2
5     1 2017-01-06 05:55:13  9680654 secs         9680654     3       3
6     1 2017-05-19 05:46:49 21167750 secs        21167750     4       4
7     1 2017-05-19 05:46:49 21167750 secs        21167750     4       4
8     1 2017-06-12 05:57:59 23242020 secs        23242020     5       5
9     1 2017-06-12 05:57:59 23242020 secs        23242020     5       5
like image 841
DataTx Avatar asked Dec 06 '17 18:12

DataTx


Video Answer


1 Answers

If you use Reduce with accumulate = TRUE (or purrr::accumulate, if you prefer), you can reset the running difference when it is greater than or equal to 5. Calling cumsum on whether that total is 0 will return the number of resets.

library(tidyverse)

df <- data.frame(id = c(1,1,1,1,1,2,2,2,2,3,3,3,3), 
                 val = c(2,10,12,15,17,2,4,7,8,12,15,20,25))

df %>% 
    group_by(id) %>% 
    mutate(reset = cumsum(
        Reduce(
            function(x, y) if (x + y >= 5) 0 else x + y, 
            diff(val), init = 0, accumulate = TRUE
        ) == 0
    ))
#> # A tibble: 13 x 3
#> # Groups:   id [3]
#>       id   val reset
#>    <dbl> <dbl> <int>
#>  1     1     2     1
#>  2     1    10     2
#>  3     1    12     2
#>  4     1    15     3
#>  5     1    17     3
#>  6     2     2     1
#>  7     2     4     1
#>  8     2     7     2
#>  9     2     8     2
#> 10     3    12     1
#> 11     3    15     1
#> 12     3    20     2
#> 13     3    25     3

or with purrr::accumulate,

df %>% 
    group_by(id) %>%
    mutate(reset = cumsum(
        accumulate(
            diff(val), 
            ~if (.x + .y >= 5) 0 else .x + .y, 
            .init = 0
        ) == 0
    ))
#> # A tibble: 13 x 3
#> # Groups:   id [3]
#>       id   val reset
#>    <dbl> <dbl> <int>
#>  1     1     2     1
#>  2     1    10     2
#>  3     1    12     2
#>  4     1    15     3
#>  5     1    17     3
#>  6     2     2     1
#>  7     2     4     1
#>  8     2     7     2
#>  9     2     8     2
#> 10     3    12     1
#> 11     3    15     1
#> 12     3    20     2
#> 13     3    25     3

Regarding the edit, the issue is that some of the diffs are 0, which is the same as what it's counting to see resets. The simplest solution is to use NA instead of zero as a reset value:

library(tidyverse)

sub.df <- structure(list(`ID` = c("1", "1", "1", "1", "1", "1", "1", "1", "1"), 
                         dateFormat = structure(c(1479955726, 1479955726, 1483703713, 
                            1495190809, 1495190809, 1497265079, 1497265079, 1474023059, 1474023061), 
                            class = c("POSIXct", "POSIXt"), tzone = "America/Chicago")), 
                    .Names = c("ID", "dateFormat"), row.names = c(NA, -9L), 
                    class = c("tbl_df", "tbl", "data.frame")) 

sub.df %>% 
    group_by(ID) %>% 
    arrange(ID, dateFormat) %>%
    mutate(reset = cumsum(is.na(
               accumulate(diff(dateFormat), 
                          ~{
                              s <- sum(.x, .y, na.rm = TRUE);
                              if (s >= 5) NA else s
                          }, 
                          .init = NA)
    )))
#> # A tibble: 9 x 3
#> # Groups:   ID [1]
#>      ID          dateFormat reset
#>   <chr>              <dttm> <int>
#> 1     1 2016-09-16 05:50:59     1
#> 2     1 2016-09-16 05:51:01     1
#> 3     1 2016-11-23 20:48:46     2
#> 4     1 2016-11-23 20:48:46     2
#> 5     1 2017-01-06 05:55:13     3
#> 6     1 2017-05-19 05:46:49     4
#> 7     1 2017-05-19 05:46:49     4
#> 8     1 2017-06-12 05:57:59     5
#> 9     1 2017-06-12 05:57:59     5

Ultimately this approach faces limitations, too, though, as if any values actually are NA, it will increment similarly. A more robust solution would be to return a list of two elements from each iteration, one for the total with resets, and one for the reset count. This is more work to implement, though:

sub.df %>% 
    group_by(ID) %>% 
    arrange(ID, dateFormat) %>%
    mutate(total_reset = accumulate(
        transpose(list(total = diff(dateFormat), reset = rep(0, n() - 1))),
        ~{
            s <- .x$total + .y$total;
            if (s >= 5) {
                data_frame(total = 0, reset = .x$reset + 1)
            } else {
                data_frame(total = s, reset = .x$reset)
            }
        }, 
        .init = data_frame(total = 0, reset = 1)
    )) %>% 
    unnest()
#> # A tibble: 9 x 4
#> # Groups:   ID [1]
#>      ID          dateFormat total reset
#>   <chr>              <dttm> <dbl> <dbl>
#> 1     1 2016-09-16 05:50:59     0     1
#> 2     1 2016-09-16 05:51:01     2     1
#> 3     1 2016-11-23 20:48:46     0     2
#> 4     1 2016-11-23 20:48:46     0     2
#> 5     1 2017-01-06 05:55:13     0     3
#> 6     1 2017-05-19 05:46:49     0     4
#> 7     1 2017-05-19 05:46:49     0     4
#> 8     1 2017-06-12 05:57:59     0     5
#> 9     1 2017-06-12 05:57:59     0     5

The total looks a little silly, but if you look at the diff, it's actually correct.

like image 82
alistaire Avatar answered Sep 26 '22 21:09

alistaire