If there's not a quick 1-3 liner for this in R, I'll definitely just use linux sort
and a short python program using groupby
, so don't bend over backwards trying to get something crazy working. Here's the input data frame:
df_in <- data.frame(
ID = c(1,1,1,1,1,2,2,2,2,2),
weight = c(150,150,151,150,150,170,170,170,171,171),
start_day = c(1,4,7,10,11,5,10,15,20,25),
end_day = c(4,7,10,11,30,10,15,20,25,30)
)
ID weight start_day end_day
1 1 150 1 4
2 1 150 4 7
3 1 151 7 10
4 1 150 10 11
5 1 150 11 30
6 2 170 5 10
7 2 170 10 15
8 2 170 15 20
9 2 171 20 25
10 2 171 25 30
I would like to do some basic aggregation by ID
and weight
, but only when the group is in consecutive rows of df_in
. Specifically, the desired output is
df_desired_out <- data.frame(
ID = c(1,1,1,2,2),
weight = c(150,151,150,170,171),
min_day = c(1,7,10,5,20),
max_day = c(7,10,30,20,30)
)
ID weight min_day max_day
1 1 150 1 7
2 1 151 7 10
3 1 150 10 30
4 2 170 5 20
5 2 171 20 30
This question seems to be extremely close to what I want, but I'm having lots of trouble adapting it for some reason.
In dplyr, I would do this by creating another grouping variable for the consecutive rows. This is what the code cumsum(c(1, diff(weight) != 0)
is doing in the code chunk below. An example of this is also here.
The group creation can be done within group_by
, and then you can proceed accordingly with making any summaries by group.
library(dplyr)
df_in %>%
group_by(ID, group_weight = cumsum(c(1, diff(weight) != 0)), weight) %>%
summarise(start_day = min(start_day), end_day = max(end_day))
Source: local data frame [5 x 5]
Groups: ID, group_weight [?]
ID group_weight weight start_day end_day
(dbl) (dbl) (dbl) (dbl) (dbl)
1 1 1 150 1 7
2 1 2 151 7 10
3 1 3 150 10 30
4 2 4 170 5 20
5 2 5 171 20 30
This approach does leave you with the extra grouping variable in the dataset, which can be removed, if needed, with select(-group_weight)
after ungrouping.
First we combine ID
and weight
. The quick-and-dirty way is using paste:
df_in$id_weight <- paste(df_in$id, df_in$weight, sep='_')
df_in
ID weight start_day end_day id_weight
1 1 150 1 4 1_150
2 1 150 4 7 1_150
3 1 151 7 10 1_151
4 1 150 10 11 1_150
5 1 150 11 30 1_150
6 2 170 5 10 2_170
7 2 170 10 15 2_170
8 2 170 15 20 2_170
9 2 171 20 25 2_171
10 2 171 25 30 2_171
Safer way is to use interaction
or group_indices
: Combine values in 4 columns to a single unique value
We can group consecutively using rle
.
rlel <- rle(df_in$id_weight)$lengths
df_in$group <- unlist(lapply(1:length(rlel), function(i) rep(i, rlel[i])))
df_in
ID weight start_day end_day id_weight group
1 1 150 1 4 1_150 1
2 1 150 4 7 1_150 1
3 1 151 7 10 1_151 2
4 1 150 10 11 1_150 3
5 1 150 11 30 1_150 3
6 2 170 5 10 2_170 4
7 2 170 10 15 2_170 4
8 2 170 15 20 2_170 4
9 2 171 20 25 2_171 5
10 2 171 25 30 2_171 5
Now with the convenient group number we can summarize by group.
df_in %>%
group_by(group) %>%
summarize(id_weight = id_weight[1],
start_day = min(start_day),
end_day = max(end_day))
# A tibble: 5 x 4
group id_weight start_day end_day
<int> <chr> <dbl> <dbl>
1 1 1_150 1 7
2 2 1_151 7 10
3 3 1_150 10 30
4 4 2_170 5 20
5 5 2_171 20 30
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With