Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create and fill new columns based on range information from two other columns

Tags:

r

tidyverse

I have the following data:

df <- data.frame(group  = c(1, 1, 1, 2, 2, 2),
                 start  = c(2, 2, 2, 7, 7, 7),
                 stop   = c(4, 7, 8, 7, 8, 9),
                 unstop = c(5, 7, 10, 7, 9, 10))

I now want to do the following:

  • Create new columns that have the names "week_1", "week_2" ... "week_10", "week_n".
  • Within each group for the FIRST ROW, I check which weeks the row was "active" in, i.e. it started in week 2, and stopped in week 4, so the row was active in week 2, 3, 4. I now want to populate the respective week columns with a 1.
  • Within each group for ALL OTHER EXCEPT THE LAST ROW, I do the same check, but now populate based on the unstop value of that row and the stop value of the next row.
  • Within each group for the LAST ROW, I do the same check, but now populate based on the range from unstop to 10 (which is the last week in my case).

I have a theoretical way. The problem is that my real-life data has 80k rows (consisting of 60k groups) and I'd need to create ~200 of such week-columns. Even filtering on 10 rows only takes ~30s for the code below.

So I'm looking for a more elegent/smarter/FASTER solution.

Expected outcome:

# A tibble: 6 × 14
# Groups:   group [2]
  group start  stop unstop week_1 week_2 week_3 week_4 week_5 week_6 week_7 week_8 week_9 week_10
  <dbl> <dbl> <dbl>  <dbl>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>   <int>
1     1     2     4      5      0      1      1      1      0      0      0      0      0       0
2     1     2     7      7      0      0      0      0      0      0      1      1      0       0
3     1     2     8     10      0      0      0      0      0      0      0      0      0       1
4     2     7     7      7      0      0      0      0      0      0      1      0      0       0
5     2     7     8      9      0      0      0      0      0      0      0      1      1       0
6     2     7     9     10      0      0      0      0      0      0      0      0      0       1

Below is how I would have approached it generally (of course not with manually defining each row_number. Apart from that, the code is also wrong and does not give the expected 0/1 values. It alsow throws many warnings. And finally, this code already runs a few seconds just for this small test data. It would run one month for my 80k/200col data set.

add_weeks <- as_tibble(as.list(setNames(rep(0L, 10),
                                        paste0("week_", 1:10))))

df |> 
  bind_cols(add_weeks) |> 
  group_by(group) |> 
  mutate(across(num_range("week_", 1:10),
                ~ if_else(row_number() == 1 & str_extract(cur_column(), "\\d+$") %in% start:stop,
                          1L,
                          .)),
         across(num_range("week_", 1:10),
                ~ if_else(row_number() == 2 & str_extract(cur_column(), "\\d+$") %in% unstop:lead(stop),
                          1L,
                          .)),
         across(num_range("week_", 1:10),
                ~ if_else(row_number() == 3 & str_extract(cur_column(), "\\d+$") %in% unstop:10,
                          1L,
                          .)))
like image 804
deschen Avatar asked Sep 12 '25 09:09

deschen


2 Answers

Now tested code. Implementation of strategy described in comment:

I’d make a matrix with names columns and assign with row and col indices. You can then either attach it as a matrix or convert to data frame.

Mat <- matrix(0, nrow(df), 10) # 200 for real case
maxwk <- 10
colnames(Mat) <- paste0("week", 1:maxwk)

# Add extra column that marks condition 
# If there are always exactly 3 row per group just rep(1:3, ngrps)

# Need to define a value for cond that identifies the three possibilities:

df$cond <- rep(1:3, length=nrow(df))  # assume all groups have exactly 3:

for ( r in 1:nrow(df) ) {
          # for first row in group
  if( df$cond[r] == 1){
     Idx <-  paste0("week", df$start[r]:df$stop[r] ) #start:stop
     Mat[r, Idx] <- 1; next}
          # second
  if( df$cond[r] == 2){ 
     Idx <-  paste0("week" , df$stop[r]:df$unstop[r] )#  stop:unstop
     Mat[r, Idx] <- 1; next}
          # third
  if( df$cond[r] == 3){
    Idx  <- paste0("week", df$unstop[r]:maxwk )    # unstop:max
    Mat[r, Idx] <- 1; next}
  }
df
  group start stop unstop cond
1     1     2    4      5    1
2     1     2    7      7    2
3     1     2    8     10    3
4     2     7    6      7    1
5     2     7    8      9    2
6     2     7    9     10    3
> Mat
     week1 week2 week3 week4 week5 week6 week7 week8 week9 week10
[1,]     0     1     1     1     0     0     0     0     0      0
[2,]     0     0     0     0     0     0     1     0     0      0
[3,]     0     0     0     0     0     0     0     0     0      1
[4,]     0     0     0     0     0     1     1     0     0      0
[5,]     0     0     0     0     0     0     0     1     1      0
[6,]     0     0     0     0     0     0     0     0     0      1

You could cbind these.

There might be performance improvements possible. Could use switch(cond, ...) to dispatch to the correct logic rather than the if( cond == .){ ., next} method. This should be much faster than code that uses ifelse or if_else. If you want to see how that's implemented, then endorse the general strategy with a checkmark and I'll spend the time to add the alternate code.

  • ran a benchmark after setting up both methods for 100 week maximum. * The warnings are from the code in the question:
> perf_results <- microbenchmark(
+     first.method    = do_first(df), sec.method=do_second(df), times=10)
There were 50 or more warnings (use warnings() to see the first 50)
> perf_results
Unit: microseconds
         expr         min        lq         mean       median          uq        max neval
 first.method 4385001.123 4416568.8 4581549.9624 4450691.5455 4615753.753 5350416.80    10
   sec.method     146.432     149.6     181.6137     188.2125     193.307     243.47    10

I wanted to see if a switch method of selecting the proper algorithm for a row would improve performance. It did and to a degree that surprised me. The switch function is analogous to the case function in Pascal and many other languages. It has two forms whose behavior is different depending on whether the first argument, EXPR is numeric or character. Here, the "dispatch" version is chosen because the "cond" column is numeric.

do_third= function(df){ Mat <- matrix(0, nrow(df), 100) # 200 for real case
maxwk <- 100
colnames(Mat) <- paste0("week", 1:maxwk)
df$cond <- rep(1:3, length=nrow(df))  # assume all groups have exactly 3: 
for( r in 1:nrow(df)) { switch( df[r,"cond"],      
         { # for first row in each group of 3
     Idx <-  paste0("week", df$start[r]:df$stop[r] ) #start:stop
     Mat[r, Idx] <- 1 }, 
          
          { # second row in group
     Idx <-  paste0("week" , df$stop[r]:df$unstop[r] )#  stop:unstop
     Mat[r, Idx] <- 1 },
          
          {# third
     Idx  <- paste0("week", df$unstop[r]:maxwk )    # unstop:max
     Mat[r, Idx] <- 1 } ) }
   }

New microbenchmark:

perf_results
Unit: nanoseconds
         expr        min         lq         mean     median         uq        max neval cld
 first.method 4304901359 4351893534 4387626725.8 4372151785 4416247096 4543314742    10   b
   sec.method     162803     173855    2588492.1     215309     216878   24081195    10  a 
   third.meth         34         53        610.6        877        940        963    10  a 
like image 130
IRTFM Avatar answered Sep 14 '25 21:09

IRTFM


FWIW, I'm posting my own solution of it. Apparently, adding 200 cols to a 60k data frame based on some conditions is extremely slow. So what I did instead is:

  • Add one chr column with the info about the weeks via str_c.
  • Create a smaller data set that just has the grouping var and this new info.
  • Then use separate_rows on this week_info to get a long format data set.
  • Then use pivot_wider and combine this info with the orginal data set.

Note that this approach works because I didn't mention in my initial post that I actually want to summarize the week info per group. So in the end I want to have one row per group. In the interest of keeping my question simple, I didn't add this to my question.

Having said that, the solution of @IRTFM is still considerably faster by a factor of 3.

df2 <- df |>
  group_by(group) |> 
  mutate(lead_stop = lead(stop, default = 0),
         n_rows = n(),
         row_number = row_number()) |> 
  ungroup() |> 
  rowwise() |> 
  mutate(split_weeks = case_when(n_rows == 1 & row_number == 1 ~ str_c(start:stop, collapse = ","),
                                 n_rows  > 1 & row_number == 1 ~ str_c(c(start:stop, unstop:lead_stop), collapse = ","),
                                 row_number == n_rows          ~ str_c(unstop:10, collapse = ","),
                                 TRUE                          ~ str_c(unstop:lead_stop, collapse = ",")))

df3 <- df2 |> 
  group_by(group) |> 
  summarize(split_weeks = unique(str_c(split_weeks, collapse = ","))) |> 
  separate_rows(split_weeks, sep = ",", convert = TRUE) |>
  distinct() |> 
  mutate(value = 1L) |>
  full_join(y = data.frame(split_weeks = 1:10)) |> 
  pivot_wider(names_from = split_weeks,
              names_prefix = "week_",
              values_from = value,
              values_fill = 0L,
              names_expand = TRUE) |> 
  filter(!is.na(group))

df4 <- df2 |> 
  ungroup() |> 
  select(-split_weeks, -n_rows) |> 
  pivot_wider(names_from = row_number, values_from = -group) |> 
  bind_cols(x = df3 |> select(-group), y = _)
like image 31
deschen Avatar answered Sep 14 '25 23:09

deschen