Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Identify groups of n consecutive numbers in a data.table field in a group

Tags:

r

data.table

This data.table shows the months of year attended by students.

DT = data.table(
 Student = c(1, 1, 1, 1, 1, 1, 1, 1, 1,
             2, 2, 2, 2, 2, 2, 2, 2,
             3, 3, 3, 3, 3, 3, 3, 3),
 Month   = c(1, 2, 3, 5, 6, 7, 8, 11, 12,
             2, 3, 4, 5, 7, 8, 9, 10,
             1, 2, 3, 5, 6, 7, 8, 9))

DT
    Student Month
 1:       1     1
 2:       1     2
 3:       1     3
 4:       1     5
 5:       1     6
 6:       1     7
 7:       1     8
 8:       1    11
 9:       1    12
10:       2     2
11:       2     3
12:       2     4
13:       2     5
14:       2     7
15:       2     8
16:       2     9
17:       2    10
18:       3     1
19:       3     2
20:       3     3
21:       3     5
22:       3     6
23:       3     7
24:       3     8
25:       3     9

I want to identify periods of three consecutive months (identified by the first month in the period). This is visualization of the data table and the eligible periods.

       1   2   3   4   5   6   7   8   9   10  11  12


1      *   *   *       *   *   *   *           *   *
       [-------]       [-------]
                           [-------]                           


2          *   *   *   *       *   *   *   *
           [-------]           [-------]
               [-------]           [-------]


3      *   *   *       *   *   *   *   *      
       [-------]       [-------]
                           [-------]
                               [-------]

Desired output:

id   First_month_in_the_period 

1    1
1    5
1    6
2    2
2    3
2    7
2    8
3    1
3    5
3    6
3    7

Looking for data.table (or dplyr) solutions.

like image 212
Orion Avatar asked Nov 29 '22 21:11

Orion


1 Answers

Use standard method (cumsum...diff...condition) to identify runs of consecutive values, which then is used as grouping variable together with 'Student'. Within each group, create sequence based on length of each run and add to first month.

DT[ , .(start = if(.N >= 3) Month[1] + 0:(.N - 3)),
    by = .(Student, r = cumsum(c(1L, diff(Month) > 1)))]
#     Student r start
#  1:       1 1     1
#  2:       1 2     5
#  3:       1 2     6
#  4:       2 3     2
#  5:       2 3     3
#  6:       2 4     7
#  7:       2 4     8
#  8:       3 4     1
#  9:       3 5     5
# 10:       3 5     6
# 11:       3 5     7

Equivalent dplyr alternative:

DT %>% 
  group_by(Student, r = cumsum(c(1L, diff(Month) > 1))) %>%
  summarise(list(data.frame(start = if(n() >= 3) Month[1] + 0:(n() - 3)))) %>%
  tidyr::unnest()

# # A tibble: 11 x 3
# # Groups:   Student [3]
#       Student     r start
#         <dbl> <int> <dbl>
#     1       1     1     1
#     2       1     2     5
#     3       1     2     6
#     4       2     3     2
#     5       2     3     3
#     6       2     4     7
#     7       2     4     8
#     8       3     4     1
#     9       3     5     5
#    10       3     5     6
#    11       3     5     7
like image 153
Henrik Avatar answered Dec 15 '22 19:12

Henrik