Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R conditional grouping of rows and numbering of groups

Tags:

r

grouping

I work with data frames for flight movements (~ 1 million rows * 108 variables) and want to group phases during which a certain criterion is met (i.e. the value of a certain variable). In order to identify these groups, I want to number them. Being a R newbie, I made it work for my case. Now I am looking for a more elegant way. In particular, I would like to overcome with the "useless" gaps in the numbering of the groups. I provide a simplified example of my dplyr data frame with the value THR for the threshold criterion. The rows are sorted by the timestamp (and thus, i can truncate this here).

THR <- c(13,17,19,22,21,19,17,12,12,17,20,20,20,17,17,13, 20,20,17,13)
df  <- as.data.frame(THR)
df  <- tbl_df(df)

To flag all rows where the criterion is (not) met

df  <- mutate(df, CRIT = THR < 19)

With the following, I managed to conditionally "cumsum" to get a unique group identification:

df <- mutate(df, GRP = ifelse(CRIT == 1, 0, cumsum(CRIT))
df
    x CRIT GRP
1  13 TRUE   0
2  17 TRUE   0
3  19 FALSE  2          
4  22 FALSE  2
5  21 FALSE  2
6  19 FALSE  2
7  17 TRUE   0
8  12 TRUE   0
9  12 TRUE   0
10 17 TRUE   0
11 20 FALSE  6
12 20 FALSE  6

While this does the trick and I can operate on the groups with group_by (e.g. summarise, filter), the numbering is not ideal as can be seen in the example output. In this example the 1st is numbered 2, and the 2nd group is numbered 6 which is in line with the cumsum() result.

I would appreciate, if anybody could shed some light on me. I was not able to find an appropriate solution in other posts.

like image 900
Rainer Avatar asked Nov 09 '22 05:11

Rainer


1 Answers

I don't you can really avoid that preliminary step of creating CRIT, though I'd suggest to add cumsum when creating it and then just run a simple cumsum/diff wrap up on it. Also, If you don't need the groups that aren't meeting the criteria, it is better to assign NA instead of just some random number such as zero. Here's a possible data.table wrap up (also, you don't need the df <- tbl_df(df) step at all)

library(data.table)
setDT(df)[, CRIT := cumsum(THR < 19)]
df[THR >= 19, GRP := cumsum(c(0L, diff(CRIT)) != 0L) + 1L]
#     THR CRIT GRP
#  1:  13    1  NA
#  2:  17    2  NA
#  3:  19    2   1
#  4:  22    2   1
#  5:  21    2   1
#  6:  19    2   1
#  7:  17    3  NA
#  8:  12    4  NA
#  9:  12    5  NA
# 10:  17    6  NA
# 11:  20    6   2
# 12:  20    6   2
# 13:  20    6   2
# 14:  17    7  NA
# 15:  17    8  NA
# 16:  13    9  NA
# 17:  20    9   3
# 18:  20    9   3
# 19:  17   10  NA
# 20:  13   11  NA
like image 56
David Arenburg Avatar answered Nov 15 '22 07:11

David Arenburg