Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fill value backwards from occurence by group with condition

Tags:

dataframe

r

dplyr

Problem: I would like to fill a value backwards from occurrence by group with a condition. I am trying to generate column C in the desired output.

Set C equal to B and fill 1 backwards if A is <= 35, stop fill if A > 35.

I am trying to complete this task using dplyr.

Building on something similar to my previous question: Fill value backwards from occurence by group

Input:

DAT_in  = data.frame(ID=c(1,1,1,1,
                       2,2,2,
                       3,3,3,
                       4,4,4,4,4), 
                  time=c(1,2,3,4,
                         1,2,3,
                         1,2,3,
                         1,2,3,4,5),
                  A=c(100,35,25,0,
                      100,75,55,
                      100,28,25,
                      100,30,45,25,0),
                  B=c(0,0,0,1,
                      0,0,0,
                      0,0,1,
                      0,0,0,0,1))

Desired output (C):

DAT_out  = data.frame(ID=c(1,1,1,1,
                   2,2,2,
                   3,3,3,
                   4,4,4,4,4), 
              time=c(1,2,3,4,
                     1,2,3,
                     1,2,3,
                     1,2,3,4,5),
              A=c(100,35,25,0,
                  100,75,55,
                  100,28,25,
                  100,30,45,25,0),
              B=c(0,0,0,1,
                  0,0,0,
                  0,0,1,
                  0,0,0,0,1),
              C=c(0,1,1,1,
                  0,0,0,
                  0,1,1,
                  0,0,0,1,1))
like image 987
BEMR Avatar asked Aug 08 '17 16:08

BEMR


3 Answers

Initially this question was aimed at dplyr so here's the dplyr solution I came up with. It's much less elegant than Frank's solution, but I already did it so why not write it down "for posterity".

Dat_out_step1 <- DAT_in %>%
  group_by(ID) %>%
  filter(B==1) %>%
  select(-A,-B) %>%
  summarize(max.time = min(time)) %>%
  full_join(DAT_in, by = "ID")


Dat_out_step2 <- Dat_out_step1 %>%
  group_by(ID) %>% 
  filter(A>35 & time < max.time) %>%
  select(-A, -B, -max.time) %>%
  group_by(ID) %>%
  summarize(min.time = max(time))

DAT_out_step3 <- Dat_out_step1 %>%
  left_join(Dat_out_step2) %>%
  mutate(C = ifelse(is.na(max.time), 0,
                    (time > min.time & time <= max.time)*1)) %>%
  select(-max.time, -min.time)

EDIT: To Frank's suggestion, instead of using ifelse() you can use (in the last block):

DAT_out_step3 <- Dat_out_step1 %>%
  left_join(Dat_out_step2) %>%
  mutate(C = replace((time > min.time & time <= max.time)*1 ,is.na(max.time), 0)) %>%
  select(-max.time, -min.time)

Old ifelse() habits die hard... Thanks Frank for spotting that one out.

like image 29
Adi Sarid Avatar answered Sep 24 '22 19:09

Adi Sarid


If you are looking for dplyr solution, will this work?

DAT_in2 <- DAT_in %>%
  mutate(C = ifelse(A <= 35 & lead(A) <= 35, 1, B)) %>%
  mutate(C = ifelse(row_number() == n(), B, C))

# Check if DAT_in2 and DAT_out are the same
identical(DAT_in2, DAT_out)
[1] TRUE
like image 26
www Avatar answered Sep 22 '22 19:09

www


This seems to work:

library(data.table)
setDT(DAT_in)

DAT_in[order(ID, -time), C := as.integer(cumsum(A > 35) == 0L), by=ID][]

all.equal(DAT_in$C, DAT_out$C) # TRUE

How it works

  • Sort backwards from the final period.
  • Mark as 1 until A > 35.

DT[i, v := ..., by=g] only orders by i during the assignment to column v; the initial sort order of DT is preserved.

as.integer coerces TRUE to 1; FALSE to 0.

like image 179
Frank Avatar answered Sep 26 '22 19:09

Frank