I came across a problem that forced me to use a loop instead of my preferred dplyr pipe flow.
I want to group rows based on consecutive observations of the same value.
For example, if the first four observations of type equal a, the first four observations should assigned to the same group. Order matters, so I can't dplyr::group_by and dplyr::summarize.
The code below should explain the problem fairly well. I was wondering if anyone could propose a less verbose way to do this, preferably using tidyverse packages, and not data.tables.
library(tidyverse)
# Crete some test data
df <- tibble(
id = 1:20,
type = c(rep("a", 5), rep("b", 5), rep("a", 5), rep("b", 5)),
val = runif(20)
)
df
#> # A tibble: 20 x 3
#> id type val
#> <int> <chr> <dbl>
#> 1 1 a 0.0606
#> 2 2 a 0.501
#> 3 3 a 0.974
#> 4 4 a 0.0833
#> 5 5 a 0.752
#> 6 6 b 0.0450
#> 7 7 b 0.367
#> 8 8 b 0.649
#> 9 9 b 0.846
#> 10 10 b 0.896
#> 11 11 a 0.178
#> 12 12 a 0.295
#> 13 13 a 0.206
#> 14 14 a 0.233
#> 15 15 a 0.851
#> 16 16 b 0.179
#> 17 17 b 0.801
#> 18 18 b 0.326
#> 19 19 b 0.269
#> 20 20 b 0.584
# Solve problem with a loop
count <- 1
df$consec_group <- NA
for (i in 1:nrow(df)) {
current <- df$type[i]
lag <- ifelse(i == 1, NA, df$type[i - 1])
lead <- ifelse(i == nrow(df), NA, df$type[i + 1])
if (lead %>% is.na) {
df$consec_group[i] <- ifelse(current == lag, count, count + 1)
} else {
df$consec_group[i] <- count
if (current != lead) count <- count + 1
}
}
df
#> # A tibble: 20 x 4
#> id type val consec_group
#> <int> <chr> <dbl> <dbl>
#> 1 1 a 0.0606 1
#> 2 2 a 0.501 1
#> 3 3 a 0.974 1
#> 4 4 a 0.0833 1
#> 5 5 a 0.752 1
#> 6 6 b 0.0450 2
#> 7 7 b 0.367 2
#> 8 8 b 0.649 2
#> 9 9 b 0.846 2
#> 10 10 b 0.896 2
#> 11 11 a 0.178 3
#> 12 12 a 0.295 3
#> 13 13 a 0.206 3
#> 14 14 a 0.233 3
#> 15 15 a 0.851 3
#> 16 16 b 0.179 4
#> 17 17 b 0.801 4
#> 18 18 b 0.326 4
#> 19 19 b 0.269 4
#> 20 20 b 0.584 4
Created on 2019-03-14 by the reprex package (v0.2.1)
This grouping of consecutive type occurrences is really just an intermediate step. My endgame is manipulate val for a given consec_group, based on the values of val that occurred within the previous consec_group. Advice on relevant packages would be appreciated.
You say "no data.tables", but are you sure? It's so *** fast and easy (in this case)...
library(data.table)
setDT(df)[, groupid := rleid(type)][]
# id type val groupid
# 1: 1 a 0.624078793 1
# 2: 2 a 0.687361541 1
# 3: 3 a 0.817702740 1
# 4: 4 a 0.669857208 1
# 5: 5 a 0.100977936 1
# 6: 6 b 0.418275823 2
# 7: 7 b 0.660119857 2
# 8: 8 b 0.876015209 2
# 9: 9 b 0.473562143 2
# 10: 10 b 0.284474633 2
# 11: 11 a 0.034154862 3
# 12: 12 a 0.391760387 3
# 13: 13 a 0.383107868 3
# 14: 14 a 0.729583433 3
# 15: 15 a 0.006288375 3
# 16: 16 b 0.530179235 4
# 17: 17 b 0.802643704 4
# 18: 18 b 0.409618633 4
# 19: 19 b 0.309363642 4
# 20: 20 b 0.021918512 4
If you insist on using the tidyverse/dplyr, you can (of course) still use the
rleid-function as follows:
df %>% mutate( groupid = data.table::rleid(type) )
on a larger sample
library(tidyverse)
library(data.table)
# Crete some large test data
df <- tibble(
id = 1:200000,
type = sample(letters[1:26], 200000, replace = TRUE),
val = runif(200000)
)
dt <- as.data.table(df)
microbenchmark::microbenchmark(
dplyr.rleid = df %>% mutate( groupid = data.table::rleid(type) ),
data.table.rleid = dt[, groupid := rleid(type)][],
rle = df %>% mutate(ID_rleid = {ID_rleid = rle(type); rep(seq_along(ID_rleid$lengths), ID_rleid$lengths)}),
rle2 = df %>% mutate(ID_rleid = with(rle(type), rep(seq_along(lengths), lengths))),
transform = transform(df, ID = with(rle(df$type), rep(seq_along(lengths), lengths))),
times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr.rleid 3.153626 3.278049 3.410363 3.444949 3.502792 3.582626 10
# data.table.rleid 2.965639 3.065959 3.173992 3.145643 3.259672 3.507009 10
# rle 13.059774 14.042797 24.364176 26.126176 29.460561 36.874054 10
# rle2 12.641319 13.553846 30.951152 24.698338 34.139786 102.791719 10
# transform 12.330717 22.419128 22.725242 25.532084 26.187634 26.702794 10
You can use a rleid()-like possibility like this:
df %>%
mutate(ID_rleid = {ID_rleid = rle(type); rep(seq_along(ID_rleid$lengths), ID_rleid$lengths)})
id type val ID_rleid
<int> <chr> <dbl> <int>
1 1 a 0.0430 1
2 2 a 0.858 1
3 3 a 0.504 1
4 4 a 0.318 1
5 5 a 0.469 1
6 6 b 0.144 2
7 7 b 0.173 2
8 8 b 0.0706 2
9 9 b 0.958 2
10 10 b 0.557 2
11 11 a 0.358 3
12 12 a 0.973 3
13 13 a 0.982 3
14 14 a 0.177 3
15 15 a 0.599 3
16 16 b 0.627 4
17 17 b 0.454 4
18 18 b 0.682 4
19 19 b 0.690 4
20 20 b 0.713 4
Or a modification (originally proposed by @d.b) that makes it more handy:
df %>%
mutate(ID_rleid = with(rle(type), rep(seq_along(lengths), lengths)))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With