I am trying to generate unique ID for each event/episode and then wanted to calculate the mean of a column based on the generated ID.
Here is an example:
Event | Val |
---|---|
A | 3 |
A | 5 |
A | 1 |
B | 8 |
B | 2 |
C | 1 |
C | 3 |
A | 2 |
A | 5 |
First, I would like to create an ID column as below,
Event | Val | ID |
---|---|---|
A | 3 | A1 |
A | 5 | A1 |
A | 1 | A1 |
B | 8 | B1 |
B | 2 | B1 |
C | 1 | C1 |
C | 3 | C1 |
A | 2 | A2 |
A | 5 | A2 |
Second, I would like to calculate the mean of 'Val' based on the generated ID column.
Thanks for your help and suggestions.
I tried 'cumsum' and 'duplicated' function, but that tend to generate continous ID numbers.
A data.table
approach:
library(data.table)
DT <- data.table(
Event = c("A", "A", "A", "B", "B", "C", "C", "A", "A"),
Val = c(3L, 5L, 1L, 8L, 2L, 1L, 3L, 2L, 5L)
)
DT[, ID := rleid(Event)][, ID := rleid(ID), by = Event][, ID := paste0(Event, ID)][, mean := mean(Val), by = ID]
> DT
Event Val ID mean
1: A 3 A1 3.0
2: A 5 A1 3.0
3: A 1 A1 3.0
4: B 8 B1 5.0
5: B 2 B1 5.0
6: C 1 C1 2.0
7: C 3 C1 2.0
8: A 2 A2 3.5
9: A 5 A2 3.5
A similar approach than @ismirsehregal but with dplyr
. Note that consecutive_id
was introduced with dplyr 1.1.0
, the latest dplyr update.
library(dplyr) #1.1.0+
df %>%
mutate(id = consecutive_id(Event)) %>%
mutate(id = paste0(Event, consecutive_id(id)), .by = Event) %>%
mutate(mean = mean(Val), .by = id)
Event Val id mean
1 A 3 A1 3.0
2 A 5 A1 3.0
3 A 1 A1 3.0
4 B 8 B1 5.0
5 B 2 B1 5.0
6 C 1 C1 2.0
7 C 3 C1 2.0
8 A 2 A2 3.5
9 A 5 A2 3.5
And another approach based on cumsum
and non-consecutive row_number
:
df %>%
mutate(id = row_number()) %>%
mutate(id = paste0(Event, cumsum(c(0, diff(id)) != 1)), .by = Event) %>%
mutate(mean = mean(Val), .by = id)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With