I have a data frame:
id <- c(rep(1, 4), rep(2, 3), rep(3, 2), 4)
rate <- c(rep(1, 3), NA, 0.5, 0.6, NA, 0.7, NA, NA)
df <- data.frame(id, rate)
and I need to replace the NA
based on the following conditions:
for (i in 1:dim(df)[1]) {
if (is.na(df$rate[i])) {
mrate <- round(mean(df$rate[df$id == df$id[i]], na.rm = T), 1)
if (is.nan(mrate)) {
df$rate[i] <- 1
} else {
df$rate[i] <- mrate
}
}
}
Apparently the for
loop is simply too slow on a big data frame with >200K rows. How can I use a much faster way without using for
loop?
Thanks!
This is a solution using data.table
s:
library(data.table)
dt <- data.table( df, key = "id" )
dt[ , rate := ifelse( is.na(rate), round( mean(rate, na.rm=TRUE), 1), rate ), by = id ]
dt[ is.na(rate), rate := 1 ]
dt
id rate
1: 1 1.0
2: 1 1.0
3: 1 1.0
4: 1 1.0
5: 2 0.5
6: 2 0.6
7: 2 0.6
8: 3 0.7
9: 3 0.7
10: 4 1.0
I am not sure though, if the ifelse
could/should be avoided.
As mentioned in my comment, for
loops in R are not specifically slow. However, often a for
loop indicates other inefficiencies in code. In this case, the subset operation that is repeated for each row to determine the mean
is most likely the slowest bit of code.
for (i in 1:dim(df)[1]) {
if (is.na(df$rate[i])) {
mrate <- round(mean(df$rate[df$id == df$id[i]], na.rm = T), 1) ## This line!
if (is.nan(mrate)) {
df$rate[i] <- 1
} else {
df$rate[i] <- mrate
}
}
}
If instead, these group averages are determined before hand, the loop can do a rapid lookup.
foo <- aggregate(df$rate, list(df$id), mean, na.rm=TRUE)
for (i in 1:dim(df)[1]) {
if (is.na(df$rate[i])) {
mrate <- foo$x[foo$Group.1 == df$id[i]]
...
However, I am still doing a subset at df$id[i]
on the large data.frame. Instead, using one of the tools that implements a split-apply-combine strategy is a good idea. Also, lets write a function that takes a single value and a pre-computed group average and does the right thing:
myfun <- function(DF) {
avg <- avgs$rate[avgs$id == unique(DF$id)]
if (is.nan(avg)) {
avg <- 1
}
DF$rate[is.na(DF$rate)] <- avg
return (DF)
}
The plyr
version:
library(plyr)
avgs <- ddply(df, .(id), summarise, rate=mean(rate, na.rm=TRUE))
result <- ddply(df, .(id), myfun)
And the likely much faster data.table
version:
library(data.table)
DT <- data.table(df)
setkey(DT, id)
DT[, avg := mean(rate, na.rm=TRUE), by=id]
DT[is.nan(avg), avg := 1]
DT[, rate := ifelse(is.na(rate), avg, rate)]
This way, we've avoided all lookup subsetting in leiu of adding a pre-calculated column and can now do row-wise lookups which are fast and efficient. The extra column can be dropped inexpensively using:
DT[, avg := NULL]
The whole shebang can be written into a function or a data.table
expression. But, IMO, that often comes at the expense of clarity!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With