Replace NA with mean matching the same ID

Question

I have a data frame:

id <- c(rep(1, 4), rep(2, 3), rep(3, 2), 4)
rate <- c(rep(1, 3), NA, 0.5, 0.6, NA, 0.7, NA, NA)
df <- data.frame(id, rate)

and I need to replace the NA based on the following conditions:

for (i in 1:dim(df)[1]) {
  if (is.na(df$rate[i])) {
    mrate <- round(mean(df$rate[df$id == df$id[i]], na.rm = T), 1)
    if (is.nan(mrate)) {
      df$rate[i] <- 1
    } else {
      df$rate[i] <- mrate
    }
  }
}

Apparently the for loop is simply too slow on a big data frame with >200K rows. How can I use a much faster way without using for loop?

Thanks!

Beasterfield · Accepted Answer

This is a solution using data.tables:

library(data.table)
dt <- data.table( df, key = "id" )
dt[ , rate := ifelse( is.na(rate), round( mean(rate, na.rm=TRUE), 1), rate ), by = id ]
dt[ is.na(rate), rate := 1 ]
dt 
    id rate
 1:  1  1.0
 2:  1  1.0
 3:  1  1.0
 4:  1  1.0
 5:  2  0.5
 6:  2  0.6
 7:  2  0.6
 8:  3  0.7
 9:  3  0.7
10:  4  1.0

I am not sure though, if the ifelse could/should be avoided.

Justin · Answer

As mentioned in my comment, for loops in R are not specifically slow. However, often a for loop indicates other inefficiencies in code. In this case, the subset operation that is repeated for each row to determine the mean is most likely the slowest bit of code.

for (i in 1:dim(df)[1]) {
  if (is.na(df$rate[i])) {
    mrate <- round(mean(df$rate[df$id == df$id[i]], na.rm = T), 1)  ## This line!
    if (is.nan(mrate)) {
      df$rate[i] <- 1
    } else {
      df$rate[i] <- mrate
    }
  }
}

If instead, these group averages are determined before hand, the loop can do a rapid lookup.

foo <- aggregate(df$rate, list(df$id), mean, na.rm=TRUE)
for (i in 1:dim(df)[1]) {
  if (is.na(df$rate[i])) {
    mrate <- foo$x[foo$Group.1 == df$id[i]]
...

However, I am still doing a subset at df$id[i] on the large data.frame. Instead, using one of the tools that implements a split-apply-combine strategy is a good idea. Also, lets write a function that takes a single value and a pre-computed group average and does the right thing:

myfun <- function(DF) {
  avg <- avgs$rate[avgs$id == unique(DF$id)]
  if (is.nan(avg)) {
    avg <- 1
  }
  DF$rate[is.na(DF$rate)] <- avg

  return (DF)
}

The plyr version:

 library(plyr)
 avgs <- ddply(df, .(id), summarise, rate=mean(rate, na.rm=TRUE))
 result <- ddply(df, .(id), myfun)

And the likely much faster data.table version:

 library(data.table)
 DT <- data.table(df)
 setkey(DT, id)

 DT[, avg := mean(rate, na.rm=TRUE), by=id]
 DT[is.nan(avg), avg := 1]

 DT[, rate := ifelse(is.na(rate), avg, rate)]

This way, we've avoided all lookup subsetting in leiu of adding a pre-calculated column and can now do row-wise lookups which are fast and efficient. The extra column can be dropped inexpensively using:

DT[, avg := NULL]

The whole shebang can be written into a function or a data.table expression. But, IMO, that often comes at the expense of clarity!

Replace NA with mean matching the same ID

Tags:

r

Rock

2 Answers

Beasterfield

Justin

Recent Activity

Donate For Us

Replace NA with mean matching the same ID

Tags:

r

Rock

2 Answers

Beasterfield

Justin

Related questions

Recent Activity

Donate For Us