Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing NA's iteratively using data table in 'R'

Tags:

r

data.table

I'm trying to replace NA's with a random sample from an appropriate group. For example in line 2 the NA is from 'France' with age and time '20-30' '30-40'. Hence I want to take a random sample of the Response column for all other 'France', '20-30', '30-40' observations.

I have the code below which works great but every value is replaced with the same random sample. For example, if I had more than one 'France', '20-30', '30-40' NA, both their corresponding R2's would be the same.

I would like each NA to be sampled independently, but data.table seems to do it 'all at once' hence I can't do that. Any ideas ?

DT <- data.table(mydf, key = "Country,Age,Time")
DT[, R2 := ifelse(is.na(Response), sample(na.omit(Response), 1), 
                  Response), by = key(DT)]
DT
#    Index Country   Age  Time Response R2
# 1:     5  France 20-30 30-40        1  1
# 2:     6  France 20-30 30-40       NA  2
# 3:     7  France 20-30 30-40        2  2
# 4:     1 Germany 20-30 15-20        1  1
# 5:     2 Germany 20-30 15-20       NA  1
# 6:     3 Germany 20-30 15-20        1  1
# 7:     4 Germany 20-30 15-20        0  0

where mydf is

mydf <- structure(list(Index = 1:7, Country = c("Germany", "Germany", 
"Germany", "Germany", "France", "France", "France"), Age = c("20-30", 
"20-30", "20-30", "20-30", "20-30", "20-30", "20-30"), Time = c("15-20", 
"15-20", "15-20", "15-20", "30-40", "30-40", "30-40"), Response = c(1L, 
NA, 1L, 0L, 1L, NA, 2L)), .Names = c("Index", "Country", "Age", 
"Time", "Response"), class = "data.frame", row.names = c(NA, -7L))
like image 499
user3154267 Avatar asked Jun 28 '26 18:06

user3154267


2 Answers

I'd do it this way:

DT[, is_na := is.na(Response)]
nas <- DT[, sample(Response[!is_na], sum(is_na), TRUE) ,
             by=list(Country, Age, Time)]$V1
DT[, R2 := Response][(is_na), R2 := nas]
like image 148
Arun Avatar answered Jul 02 '26 05:07

Arun


set.seed(1234)
require(data.table)
DT <- data.table(mydf, key = "Country,Age,Time")

First Step

DT[, R2 := sample(na.omit(Response), length(Response), replace = T), 
   by = key(DT)]

DT

#    Index Country   Age  Time Response R2
# 1:     5  France 20-30 30-40        1  1
# 2:     6  France 20-30 30-40       NA  2
# 3:     7  France 20-30 30-40        2  2
# 4:     1 Germany 20-30 15-20        1  1
# 5:     2 Germany 20-30 15-20       NA  0
# 6:     3 Germany 20-30 15-20        1  1
# 7:     4 Germany 20-30 15-20        0  1

EDIT

Second step

In the first step, you sample accross groups (by = ...) and get a value for R2. The second step, updates R2 with the Response values that don't have NAs.

DT[!is.na(Response), R2 := Response]

DT

#    Index Country   Age  Time Response R2
# 1:     5  France 20-30 30-40        1  1
# 2:     6  France 20-30 30-40       NA  2
# 3:     7  France 20-30 30-40        2  2
# 4:     1 Germany 20-30 15-20        1  1
# 5:     2 Germany 20-30 15-20       NA  0
# 6:     3 Germany 20-30 15-20        1  1
# 7:     4 Germany 20-30 15-20        0  0
like image 42
marbel Avatar answered Jul 02 '26 03:07

marbel



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!