Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing all missing values in R data.table with a value

Tags:

If you have an R data.table that has missing values, how do you replace all of them with say, the value 0? E.g.

aa = data.table(V1=1:10,V2=c(1,2,2,3,3,3,4,4,4,4)) bb = data.table(V1=3:6,X=letters[1:4]) setkey(aa,V1) setkey(bb,V1) tt = bb[aa]      V1  X V2  1:  1 NA  1  2:  2 NA  2  3:  3  a  2  4:  4  b  3  5:  5  c  3  6:  6  d  3  7:  7 NA  4  8:  8 NA  4  9:  9 NA  4 10: 10 NA  4 

Any way to do this in one line? If it were just a matrix, you could just do:

tt[is.na(tt)] = 0 
like image 547
FBC Avatar asked Dec 12 '13 05:12

FBC


2 Answers

is.na (being a primitive) has relatively very less overhead and is usually quite fast. So, you can just loop through the columns and use set to replace NA with0`.

Using <- to assign will result in a copy of all the columns and this is not the idiomatic way using data.table.

First I'll illustrate as to how to do it and then show how slow this can get on huge data (due to the copy):

One way to do this efficiently:

for (i in seq_along(tt)) set(tt, i=which(is.na(tt[[i]])), j=i, value=0) 

You'll get a warning here that "0" is being coerced to character to match the type of column. You can ignore it.

Why shouldn't you use <- here:

# by reference - idiomatic way set.seed(45) tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3)) tracemem(tt) # modifies value by reference - no copy system.time({ for (i in seq_along(tt))      set(tt, i=which(is.na(tt[[i]])), j=i, value=0) }) #   user  system elapsed  #  0.284   0.083   0.386   # by copy - NOT the idiomatic way set.seed(45) tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3)) tracemem(tt) # makes copy system.time({tt[is.na(tt)] <- 0}) # a bunch of "tracemem" output showing the copies being made #   user  system elapsed  #  4.110   0.976   5.187  
like image 140
Arun Avatar answered Nov 03 '22 21:11

Arun


Nothing unusual here:

tt[is.na(tt)] = 0 

..will work.

This is somewhat confusing however given that:

tt[is.na(tt)] 

...currently returns:

Error in [.data.table(tt, is.na(tt)) : i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). Please let datatable-help know if you'd like this, or add your comments to FR #1611.

like image 39
thelatemail Avatar answered Nov 03 '22 19:11

thelatemail