I have the following data:
df <- data.frame(group = c(1, 1, 1, 2, 2, 2),
start = c(2, 2, 2, 7, 7, 7),
stop = c(4, 7, 8, 7, 8, 9),
unstop = c(5, 7, 10, 7, 9, 10))
I now want to do the following:
I have a theoretical way. The problem is that my real-life data has 80k rows (consisting of 60k groups) and I'd need to create ~200 of such week-columns. Even filtering on 10 rows only takes ~30s for the code below.
So I'm looking for a more elegent/smarter/FASTER solution.
Expected outcome:
# A tibble: 6 × 14
# Groups: group [2]
group start stop unstop week_1 week_2 week_3 week_4 week_5 week_6 week_7 week_8 week_9 week_10
<dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 2 4 5 0 1 1 1 0 0 0 0 0 0
2 1 2 7 7 0 0 0 0 0 0 1 1 0 0
3 1 2 8 10 0 0 0 0 0 0 0 0 0 1
4 2 7 7 7 0 0 0 0 0 0 1 0 0 0
5 2 7 8 9 0 0 0 0 0 0 0 1 1 0
6 2 7 9 10 0 0 0 0 0 0 0 0 0 1
Below is how I would have approached it generally (of course not with manually defining each row_number. Apart from that, the code is also wrong and does not give the expected 0/1 values. It alsow throws many warnings. And finally, this code already runs a few seconds just for this small test data. It would run one month for my 80k/200col data set.
add_weeks <- as_tibble(as.list(setNames(rep(0L, 10),
paste0("week_", 1:10))))
df |>
bind_cols(add_weeks) |>
group_by(group) |>
mutate(across(num_range("week_", 1:10),
~ if_else(row_number() == 1 & str_extract(cur_column(), "\\d+$") %in% start:stop,
1L,
.)),
across(num_range("week_", 1:10),
~ if_else(row_number() == 2 & str_extract(cur_column(), "\\d+$") %in% unstop:lead(stop),
1L,
.)),
across(num_range("week_", 1:10),
~ if_else(row_number() == 3 & str_extract(cur_column(), "\\d+$") %in% unstop:10,
1L,
.)))
Now tested code. Implementation of strategy described in comment:
I’d make a matrix with names columns and assign with row and col indices. You can then either attach it as a matrix or convert to data frame.
Mat <- matrix(0, nrow(df), 10) # 200 for real case
maxwk <- 10
colnames(Mat) <- paste0("week", 1:maxwk)
# Add extra column that marks condition
# If there are always exactly 3 row per group just rep(1:3, ngrps)
# Need to define a value for cond that identifies the three possibilities:
df$cond <- rep(1:3, length=nrow(df)) # assume all groups have exactly 3:
for ( r in 1:nrow(df) ) {
# for first row in group
if( df$cond[r] == 1){
Idx <- paste0("week", df$start[r]:df$stop[r] ) #start:stop
Mat[r, Idx] <- 1; next}
# second
if( df$cond[r] == 2){
Idx <- paste0("week" , df$stop[r]:df$unstop[r] )# stop:unstop
Mat[r, Idx] <- 1; next}
# third
if( df$cond[r] == 3){
Idx <- paste0("week", df$unstop[r]:maxwk ) # unstop:max
Mat[r, Idx] <- 1; next}
}
df
group start stop unstop cond
1 1 2 4 5 1
2 1 2 7 7 2
3 1 2 8 10 3
4 2 7 6 7 1
5 2 7 8 9 2
6 2 7 9 10 3
> Mat
week1 week2 week3 week4 week5 week6 week7 week8 week9 week10
[1,] 0 1 1 1 0 0 0 0 0 0
[2,] 0 0 0 0 0 0 1 0 0 0
[3,] 0 0 0 0 0 0 0 0 0 1
[4,] 0 0 0 0 0 1 1 0 0 0
[5,] 0 0 0 0 0 0 0 1 1 0
[6,] 0 0 0 0 0 0 0 0 0 1
You could cbind
these.
There might be performance improvements possible. Could use switch(cond, ...)
to dispatch to the correct logic rather than the if( cond == .){ ., next}
method. This should be much faster than code that uses ifelse
or if_else
. If you want to see how that's implemented, then endorse the general strategy with a checkmark and I'll spend the time to add the alternate code.
> perf_results <- microbenchmark(
+ first.method = do_first(df), sec.method=do_second(df), times=10)
There were 50 or more warnings (use warnings() to see the first 50)
> perf_results
Unit: microseconds
expr min lq mean median uq max neval
first.method 4385001.123 4416568.8 4581549.9624 4450691.5455 4615753.753 5350416.80 10
sec.method 146.432 149.6 181.6137 188.2125 193.307 243.47 10
I wanted to see if a switch
method of selecting the proper algorithm for a row would improve performance. It did and to a degree that surprised me. The switch
function is analogous to the case
function in Pascal and many other languages. It has two forms whose behavior is different depending on whether the first argument, EXPR
is numeric or character. Here, the "dispatch" version is chosen because the "cond" column is numeric.
do_third= function(df){ Mat <- matrix(0, nrow(df), 100) # 200 for real case
maxwk <- 100
colnames(Mat) <- paste0("week", 1:maxwk)
df$cond <- rep(1:3, length=nrow(df)) # assume all groups have exactly 3:
for( r in 1:nrow(df)) { switch( df[r,"cond"],
{ # for first row in each group of 3
Idx <- paste0("week", df$start[r]:df$stop[r] ) #start:stop
Mat[r, Idx] <- 1 },
{ # second row in group
Idx <- paste0("week" , df$stop[r]:df$unstop[r] )# stop:unstop
Mat[r, Idx] <- 1 },
{# third
Idx <- paste0("week", df$unstop[r]:maxwk ) # unstop:max
Mat[r, Idx] <- 1 } ) }
}
New microbenchmark:
perf_results
Unit: nanoseconds
expr min lq mean median uq max neval cld
first.method 4304901359 4351893534 4387626725.8 4372151785 4416247096 4543314742 10 b
sec.method 162803 173855 2588492.1 215309 216878 24081195 10 a
third.meth 34 53 610.6 877 940 963 10 a
FWIW, I'm posting my own solution of it. Apparently, adding 200 cols to a 60k data frame based on some conditions is extremely slow. So what I did instead is:
str_c
.separate_rows
on this week_info to get a long format data set.pivot_wider
and combine this info with the orginal data set.Note that this approach works because I didn't mention in my initial post that I actually want to summarize the week info per group. So in the end I want to have one row per group. In the interest of keeping my question simple, I didn't add this to my question.
Having said that, the solution of @IRTFM is still considerably faster by a factor of 3.
df2 <- df |>
group_by(group) |>
mutate(lead_stop = lead(stop, default = 0),
n_rows = n(),
row_number = row_number()) |>
ungroup() |>
rowwise() |>
mutate(split_weeks = case_when(n_rows == 1 & row_number == 1 ~ str_c(start:stop, collapse = ","),
n_rows > 1 & row_number == 1 ~ str_c(c(start:stop, unstop:lead_stop), collapse = ","),
row_number == n_rows ~ str_c(unstop:10, collapse = ","),
TRUE ~ str_c(unstop:lead_stop, collapse = ",")))
df3 <- df2 |>
group_by(group) |>
summarize(split_weeks = unique(str_c(split_weeks, collapse = ","))) |>
separate_rows(split_weeks, sep = ",", convert = TRUE) |>
distinct() |>
mutate(value = 1L) |>
full_join(y = data.frame(split_weeks = 1:10)) |>
pivot_wider(names_from = split_weeks,
names_prefix = "week_",
values_from = value,
values_fill = 0L,
names_expand = TRUE) |>
filter(!is.na(group))
df4 <- df2 |>
ungroup() |>
select(-split_weeks, -n_rows) |>
pivot_wider(names_from = row_number, values_from = -group) |>
bind_cols(x = df3 |> select(-group), y = _)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With