How to impute values in a data.table by groups?

Question

Take the following data table:

# IMPUTING VALUES
library(data.table)
set.seed(1337)
mydt <- data.table(Year = rep(2000:2005, each = 10),
             Type = c("A","B"),
             Value = 30 + rnorm(60)
             )
naRows <- sample(nrow(mydt),15)
mydt[ naRows, Value := NA]
setkey(mydt,Year,Type)

How would I go about imputing the NAs with the median by Year and Type? I have tried the following

# computed medians
computedMedians <- mydt[, .(Median = median(Value, na.rm = TRUE)), keyby = .(Year,Type)]
# dataset of just NA rows
dtNAs <- mydt[ is.na(Value), .SD, by = .(Year,Type)]


mydt[ is.na(Value),
      Imputations := dtNAs[computedMedians, nomatch = 0][, Median], 
      by = .(Year,Type)]
mydt

but when you run the code, you'll see that it works unless a group is missing data completely, and the computed medians get recycled. Is there a simpler way? or how would you go about getting just the last error fixed?

Arun · Accepted Answer

If you prefer updating the rows without copying the entire column, then:

require(data.table) # v1.9.6+
cols = c("Year", "Type")
dt[is.na(Value), Value := dt[.BY, median(Value, na.rm=TRUE), on=cols], by=c(cols)]

.BY is a special symbol which is a named list containing the groups. Although this requires a join with the entire data.table every time, it should be quite fast, as it's searching for only one group.

Frank · Answer

There's no need to make a secondary table; it can be done inside a single by-group call:

mydt[, 
  Value := replace(Value, is.na(Value), median(Value, na.rm=TRUE))
, by=.(Year,Type)]

This imputation doesn't guarantee that all missing values are filled (e.g., 2005-B is still NA).

How to impute values in a data.table by groups?

Tags:

r

data.table

jks612

2 Answers

Arun

Frank

Recent Activity

Donate For Us

How to impute values in a data.table by groups?

Tags:

r

data.table

jks612

2 Answers

Arun

Frank

Related questions

Recent Activity

Donate For Us