How to fill NA with median?

Question

Example data:

set.seed(1)
df <- data.frame(years=sort(rep(2005:2010, 12)), 
                 months=1:12, 
                 value=c(rnorm(60),NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))

head(df)
  years months      value
1  2005      1 -0.6264538
2  2005      2  0.1836433
3  2005      3 -0.8356286
4  2005      4  1.5952808
5  2005      5  0.3295078
6  2005      6 -0.8204684

Tell me please, how i can replace NA in df$value to median of others months? "value" must contain the median of value of all previous values for the same month. That is, if current month is May, "value" must contain the median value for all previous values of the month of May.

Justin · Accepted Answer

you want to use the test is.na function:

df$value[is.na(df$value)] <- median(df$value, na.rm=TRUE)

which says for all the values where df$value is NA, replace it with the right hand side. You need the na.rm=TRUE piece or else the median function will return NA

to do this month by month, there are many choices, but i think plyr has the simplest syntax:

library(plyr)
ddply(df, 
      .(months), 
      transform, 
      value=ifelse(is.na(value), median(value, na.rm=TRUE), value))

you can also use data.table. this is an especially good choice if your data is large:

library(data.table)
DT <- data.table(df)
setkey(DT, months)

DT[,value := ifelse(is.na(value), median(value, na.rm=TRUE), value), by=months]

There are many other ways, but there are two!

Luciano Selzer · Answer

Or with ave

df <- data.frame(years=sort(rep(2005:2010, 12)),
months=1:12,
value=c(rnorm(60),NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
df$value[is.na(df$value)] <- with(df, ave(value, months, 
   FUN = function(x) median(x, na.rm = TRUE)))[is.na(df$value)]

Since there are so many answers let's see which is fastest.

plyr2 <- function(df){
  medDF <- ddply(df,.(months),summarize,median=median(value,na.rm=TRUE))
df$value[is.na(df$value)] <- medDF$median[match(df$months,medDF$months)][is.na(df$value)]
  df
}
library(plyr)
library(data.table)
DT <- data.table(df)
setkey(DT, months)


benchmark(ave = df$value[is.na(df$value)] <- 
  with(df, ave(value, months, 
               FUN = function(x) median(x, na.rm = TRUE)))[is.na(df$value)],
          tapply = df$value[61:72] <- 
            with(df, tapply(value, months, median, na.rm=TRUE)),
          sapply = df[61:72, 3] <- sapply(split(df[1:60, 3], df[1:60, 2]), median),
          plyr = ddply(df, .(months), transform, 
                       value=ifelse(is.na(value), median(value, na.rm=TRUE), value)),
          plyr2 = plyr2(df),
          data.table = DT[,value := ifelse(is.na(value), median(value, na.rm=TRUE), value), by=months],
          order = "elapsed")
        test replications elapsed relative user.self sys.self user.child sys.child
3     sapply          100   0.209 1.000000     0.196    0.000          0         0
1        ave          100   0.260 1.244019     0.244    0.000          0         0
6 data.table          100   0.271 1.296651     0.264    0.000          0         0
2     tapply          100   0.271 1.296651     0.256    0.000          0         0
5      plyr2          100   1.675 8.014354     1.612    0.004          0         0
4       plyr          100   2.075 9.928230     2.004    0.000          0         0

I would have bet that data.table was the fastest.

[ Matthew Dowle ] The task being timed here takes at most 0.02 seconds (2.075/100). data.table considers that insignificant. Try setting replications to 1 and increasing the data size, instead. Or timing the fastest of 3 runs is also a common rule of thumb. More verbose discussion in these links :

Evidence that data.table isn't always fastest
Benchmarks in Averaging column values for specific sections of data corresponding to other column values
London R presentation, June 2012 (slide 21 headed "Other")
A transform by group benchmark in an extreme case

How to fill NA with median?

Tags:

r

data.table

statistics

plyr

Sheridan

2 Answers

Justin

Luciano Selzer

Recent Activity

Donate For Us

How to fill NA with median?

Tags:

r

data.table

statistics

plyr

Sheridan

2 Answers

Justin

Luciano Selzer

Related questions

Recent Activity

Donate For Us