Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace NA with mean matching the same ID

Tags:

r

I have a data frame:

id <- c(rep(1, 4), rep(2, 3), rep(3, 2), 4)
rate <- c(rep(1, 3), NA, 0.5, 0.6, NA, 0.7, NA, NA)
df <- data.frame(id, rate)

and I need to replace the NA based on the following conditions:

for (i in 1:dim(df)[1]) {
  if (is.na(df$rate[i])) {
    mrate <- round(mean(df$rate[df$id == df$id[i]], na.rm = T), 1)
    if (is.nan(mrate)) {
      df$rate[i] <- 1
    } else {
      df$rate[i] <- mrate
    }
  }
}

Apparently the for loop is simply too slow on a big data frame with >200K rows. How can I use a much faster way without using for loop?

Thanks!

like image 275
Rock Avatar asked Dec 15 '22 10:12

Rock


2 Answers

This is a solution using data.tables:

library(data.table)
dt <- data.table( df, key = "id" )
dt[ , rate := ifelse( is.na(rate), round( mean(rate, na.rm=TRUE), 1), rate ), by = id ]
dt[ is.na(rate), rate := 1 ]
dt 
    id rate
 1:  1  1.0
 2:  1  1.0
 3:  1  1.0
 4:  1  1.0
 5:  2  0.5
 6:  2  0.6
 7:  2  0.6
 8:  3  0.7
 9:  3  0.7
10:  4  1.0

I am not sure though, if the ifelse could/should be avoided.

like image 81
Beasterfield Avatar answered Dec 22 '22 01:12

Beasterfield


As mentioned in my comment, for loops in R are not specifically slow. However, often a for loop indicates other inefficiencies in code. In this case, the subset operation that is repeated for each row to determine the mean is most likely the slowest bit of code.

for (i in 1:dim(df)[1]) {
  if (is.na(df$rate[i])) {
    mrate <- round(mean(df$rate[df$id == df$id[i]], na.rm = T), 1)  ## This line!
    if (is.nan(mrate)) {
      df$rate[i] <- 1
    } else {
      df$rate[i] <- mrate
    }
  }
}

If instead, these group averages are determined before hand, the loop can do a rapid lookup.

foo <- aggregate(df$rate, list(df$id), mean, na.rm=TRUE)
for (i in 1:dim(df)[1]) {
  if (is.na(df$rate[i])) {
    mrate <- foo$x[foo$Group.1 == df$id[i]]
...

However, I am still doing a subset at df$id[i] on the large data.frame. Instead, using one of the tools that implements a split-apply-combine strategy is a good idea. Also, lets write a function that takes a single value and a pre-computed group average and does the right thing:

myfun <- function(DF) {
  avg <- avgs$rate[avgs$id == unique(DF$id)]
  if (is.nan(avg)) {
    avg <- 1
  }
  DF$rate[is.na(DF$rate)] <- avg

  return (DF)
}

The plyr version:

 library(plyr)
 avgs <- ddply(df, .(id), summarise, rate=mean(rate, na.rm=TRUE))
 result <- ddply(df, .(id), myfun)

And the likely much faster data.table version:

 library(data.table)
 DT <- data.table(df)
 setkey(DT, id)

 DT[, avg := mean(rate, na.rm=TRUE), by=id]
 DT[is.nan(avg), avg := 1]

 DT[, rate := ifelse(is.na(rate), avg, rate)]

This way, we've avoided all lookup subsetting in leiu of adding a pre-calculated column and can now do row-wise lookups which are fast and efficient. The extra column can be dropped inexpensively using:

DT[, avg := NULL]

The whole shebang can be written into a function or a data.table expression. But, IMO, that often comes at the expense of clarity!

like image 26
Justin Avatar answered Dec 21 '22 23:12

Justin