Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: for loop creating new columns populated by conditional statement based on the previous column

my [simplified] data looks like this:

id = sample(1:20, 5)
first_active = c(1,1,1,2,3)
week1 = c(1,1,1,0,0)
week2 = c(1,0,0,1,0)
week3 = c(1,0,1,0,1)
week4 = c(1,0,0,0,1)
week5 = c(0,0,0,0,1)

df = data.frame(cbind(id, first_active, week1, week2, week3, week4, week5))

I want to create a for loop that would:

i) in the same data.frame, create columns p1, p2,... corresponding to week1, week2,... columns and populate them with the following:

i) if the corresponding week value is not 0, then "active"`

ii) if value for a given week is 0, then check the previous p-columns status: if p[i-1] == "active" then "lapsed1"

iii) if value for a given week is 0, then check the previous p-columns status: if p[i-1] == "lapsed[j]" then "lapsed[j+1]"

iv) otherwise, return NA

this would be the solution to the above example (using mutate in dplyr):

df %>%
mutate( p1 = ifelse(week1 != 0, "active", NA),
      p2 = ifelse(week2 !=0, "active", 
                  ifelse(p1 == "active", "lapsed1", NA)),
      p3 = ifelse(week3 !=0, "active", 
                  ifelse(p2 == "lapsed1", "lapsed2",
                  ifelse(p2 == "active", "lapsed1", NA))),
      p4 = ifelse(week4 !=0, "active", 
                  ifelse(p3 == "lapsed2", "lapsed3",
                  ifelse(p3 == "lapsed1", "lapsed2",
                         ifelse(p3 == "active", "lapsed1", NA)))),
      p5 = ifelse(week5 !=0, "active", 
                  ifelse(p4 == "lapsed3", "lapsed4",
                  ifelse(p4 == "lapsed2", "lapsed3",
                         ifelse(p4 == "lapsed1", "lapsed2",
                                ifelse(p4 == "active", "lapsed1", NA)))))
  )


 id first_active week1 week2 week3 week4 week5     p1      p2      p3      p4      p5
  9            1     1     1     1     1     0 active  active  active  active lapsed1
  5            1     1     0     0     0     0 active lapsed1 lapsed2 lapsed3 lapsed4
 14            1     1     0     1     0     0 active lapsed1  active lapsed1 lapsed2
  3            2     0     1     0     0     0   <NA>  active lapsed1 lapsed2 lapsed3
  8            3     0     0     1     1     1   <NA>    <NA>  active  active  active

I want to create a function/for loop that would do it automatically, as my original data has tens of 'week' columns to refer to.

What I managed to get so far is:

df$p1 = ifelse(df$week1 > 0, "active", NA) # initiating the first p-column

for(i in 2:(ncol(df)-2)) { # defining dynamically number of periods

column_to_write = paste0("p", i, sep="") # column to be populated 
prev_column = paste0("p", i-1, sep="") #previous p-column to the one that's being populated
orig_column = paste0("week", i, sep="") #reference 'week' column
j = 1 #initiating 'lapsed' number

df[column_to_write] = ifelse(df[orig_column]> 0, "active", 
                                  ifelse(df[prev_column] == "active", paste("lapsed", j, sep=""), 
                                  ifelse(df[prev_column] == paste0("lapsed", j, sep=""), paste0("lapsed", j=j+1, sep=""), NA)))

}

but this only gives me max values of "lapsed2" and creates new columns called week[i] rather than p[i].

 id first_active week1 week2 week3 week4 week5     p1   week2   week3   week4   week5
  9            1     1     1     1     1     0 active  active  active  active lapsed1
  5            1     1     0     0     0     0 active lapsed1 lapsed2    <NA>    <NA>
 14            1     1     0     1     0     0 active lapsed1  active lapsed1 lapsed2
  3            2     0     1     0     0     0   <NA>  active lapsed1 lapsed2    <NA>
  8            3     0     0     1     1     1   <NA>    <NA>  active  active  active

How do I change the code so that numbers in "lapsed" values continue to rise beyond 2?

Thanks for your help! Kasia

like image 830
Kasia Kulma Avatar asked Sep 19 '16 14:09

Kasia Kulma


1 Answers

At the end I gave up on the for loop and instead followed the suggestions posted by @Gregor; here's what I did:

df_long = melt(df, id.vars = c("id", "first_active")) #transformed my wide data to the long format
colnames(df_long) = c("id", "first_active", "week_num", "week_orders")


df_long = 
df_long %>%
mutate(p_var = paste("p", substr(week_num, 5, 5), sep="")) %>% #created p-columns that correspond to respective weeks arrange(id, week_num) %>%
group_by(id) %>%
mutate(active_var = ifelse(week_orders != 0, "active", 
                  ifelse(first_active < as.numeric(substr(week_num, 5,5)),
                         "lapsed", NA))) %>% #created a column that would return either "active", "lapsed" or NA depending on user activity
     mutate(lapsed_num =  sequence(rle(active_var)[["lengths"]]), #created a column that would count the number of occurences of "lapsed" for a given id; it would start counting from 1 if after "active" appeared 
            final = ifelse(active_var == "active", active_var, 
                           ifelse(active_var == "lapsed", paste(active_var, lapsed_num, sep=""), NA))) %>% #finally, the column takes "active" status or coalesces "lapsed" with the sequence number
select(id, first_active, week_num, week_orders, p_var, final) %>%
                           data.frame()

At the end, my data looked like this:

head(df_final, 25)
active_var id first_active week_num week_orders p_var   final
     <NA>  3            2    week1           0    p1    <NA>
   active  3            2    week2           1    p2  active
   lapsed  3            2    week3           0    p3 lapsed1
   lapsed  3            2    week4           0    p4 lapsed2
   lapsed  3            2    week5           0    p5 lapsed3
   active  5            1    week1           1    p1  active

So, I all I needed to do was to cast the data.frame (in two steps)

df_weeks = dcast(df_long[, 1:4], id + first_active ~ week_num,  value.var = "week_orders")

df_p = dcast(df_long[, c(1:2, 5:6)], id + first_active ~ p_var,  value.var = "final")

And join them..

df_solution = inner_join(df_weeks, df_p)

Voila!

df_solution
id first_active week1 week2 week3 week4 week5     p1      p2      p3      p4      p5
 3            2     0     1     0     0     0   <NA>  active lapsed1 lapsed2 lapsed3
 5            1     1     0     0     0     0 active lapsed1 lapsed2 lapsed3 lapsed4
 8            3     0     0     1     1     1   <NA>    <NA>  active  active  active
 9            1     1     1     1     1     0 active  active  active  active lapsed1
14            1     1     0     1     0     0 active lapsed1  active lapsed1 lapsed2
like image 164
Kasia Kulma Avatar answered Nov 17 '22 03:11

Kasia Kulma