Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compare with values adjacent in a sequence in the same group

Tags:

r

Let's say I have something like this:

set.seed(0)
the.df <- data.frame( x=rep(letters[1:3], each=4),
                        n=rep(0:3, 3),
                        val=round(runif(12)))
the.df


   x n val
1  a 0   1
2  a 1   0
3  a 2   0
4  a 3   1
5  b 0   1
6  b 1   0
7  b 2   1
8  b 3   1
9  c 0   1
10 c 1   1
11 c 2   0
12 c 3   0

Within each x, starting from n==2 (going from small to large), I want to set val to 0 if the previous val (in terms of n) is 0; otherwise, leave it as is.

For example, in the subset x=="b", I first ignore the two rows where n < 2. Now, in Row 7, because the previous val is 0 (the.df$val[the.df$x=="b" & the.df$n==1]), I set val to 0 (the.df$val[the.df$x=="b" & the.df$n==2] <- 0). Then on Row 8, now that val for the previous n is 0 (we just set it), I also want to set val here to 0 (the.df$val[the.df$x=="b" & the.df$n==3] <- 0).

Imagine that the data.frame is not sorted. Therefore procedures that depend on the order would require a sort. I also can't assume that adjacent rows exist (e.g., the row the.df[the.df$x=="a" & the.df$n==1, ] might be missing).

The trickiest part seems to be evaluating val in sequence. I can do this using a loop but I imagine that it would be inefficient (I have millions of rows). Is there a way I can do this more efficiently?

EDIT: wanted output

the.df

   x n val wanted
1  a 0   1      1
2  a 1   0      0
3  a 2   0      0
4  a 3   1      0
5  b 0   1      1
6  b 1   0      0
7  b 2   1      0
8  b 3   1      0
9  c 0   1      1
10 c 1   1      1
11 c 2   0      0
12 c 3   0      0

Also, I don't mind making new columns (e.g., putting the wanted values there).

like image 484
ceiling cat Avatar asked Aug 19 '16 09:08

ceiling cat


1 Answers

Using data.table I would try the following

library(data.table)
setDT(the.df)[order(n), 
          val := if(length(indx <- which(val[2:.N] == 0L))) 
            c(val[1:(indx[1L] + 1L)], rep(0L, .N - (indx[1L] + 1L))), 
          by = x]
the.df
#     x n val
#  1: a 0   1
#  2: a 1   0
#  3: a 2   0
#  4: a 3   0
#  5: b 0   1
#  6: b 1   0
#  7: b 2   0
#  8: b 3   0
#  9: c 0   1
# 10: c 1   1
# 11: c 2   0
# 12: c 3   0

This will simultaneously order the data by n (as you said it's not ordered in real life) and recreate val by condition (meaning that if condition not satisfied, val will be untouched).


Hopefully in the near future this will be implemented and then the code could potentially be

setDT(the.df)[order(n), val[n > 2] := if(val[2L] == 0) 0L, by = x]

Which could be a great improvement both performance and syntax wise

like image 92
David Arenburg Avatar answered Oct 01 '22 13:10

David Arenburg