I have a data frame that is made up of mostly sequential rows. Mostly meaning that some are out of sequence or missing. When the the sequential row for the current row is present, I'd like to perform some function using data from both rows. If it's not present, skip it and move on. I know I can do this with a loop, but it's quite slow. I think this has something to do with using the index. Here is an example of my problem using sample data and a desired result that uses a loop.
df <- data.frame(id=1:10, x=rnorm(10))
df <- df[c(1:3, 5:10), ]
df$z <- NA
dfLoop <- function(d)
{
for(i in 1:(nrow(d)-1))
{
if(d[i+1, ]$id - d[i, ]$id == 1)
{
d[i, ]$z = d[i+1, ]$x - d[i, ]$x
}
}
return(d)
}
dfLoop(df)
So how might I be able to get the same result without using a loop? Thanks for any help.
Give this a try:
index <- which(diff(df$id)==1) #gives the index of rows that have a row below in sequence
df$z[index] <- diff(df$x)[index]
As a function:
fun <- function(x) {
index <- which(diff(x$id)==1)
xdiff <- diff(x$x)
x$z[index] <- xdiff[index]
return(x)
}
Compare with your loop:
a <- fun(df)
b <- dfLoop(df)
identical(a, b)
[1] TRUE
R is vector-based. Try this code -- it is just like your for loop but using the entire range at once:
i <- 1:(nrow(d)-1)
d[i+1, ]$id - d[i, ]$id == 1
You should see a vector of length nrow(d) - 1, containing the indexes where the condition holds. Save it:
cond <- (d[i+1, ]$id - d[i, ]$id == 1)
You can also get the positions of all TRUE values:
(cond.pos <- which(cond))
Now you can assign values to those indexes where the condition is true:
d[cond.pos, ]$z <- d[cond.pos+1, ]$x - d[cond.pos, ]$x
There are quite a few ways to achieve what you want, but it takes some experience to grab the "vector-based" idea. Especially the diff function, as noted by alexwhan, can help save some typing for this specific example.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With