Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I impute missing variables in R using dplyr?

I would like to impute missing values for a variable given the existing values. In var2, we notice that there are a lot of NAs.

  1. If any 2 ids are the same, then their values for var2 are the same.
  2. If the id has no values for var2, like in the case of id==2, then we just output as NA.

It should look from df_old to df_new.

 df_old<- read.table(header = TRUE, text = "
 id  var1  var2 
  1  A       12    
  1  B       NA    
  1  E       NA    
  2  G       NA
  2  J       NA
 ")

df_new<- read.table(header = TRUE, text = "
id  var1  var2 
 1  A       12    
 1  B       12    
 1  E       12    
 2  G       NA
 2  J       NA
")

I tried take:

df_new<-df_old %>%
        group_by(id) %>%
        mutate(var2=na.omit(var2))

I believe it doesn't work because of the second case. I was also wondering if using ifelse would be okay. Need help thanks!

like image 357
HNSKD Avatar asked Jul 15 '16 07:07

HNSKD


3 Answers

If there is only one var2 value per id available you could simply do:

df_old %>%
  group_by(id) %>%
  mutate(var2 = min(var2, na.rm = TRUE))

Source: local data frame [5 x 3]
Groups: id [2]

     id   var1  var2
  <int> <fctr> <int>
1     1      A    12
2     1      B    12
3     1      E    12
4     2      G    NA
5     2      J    NA

Another option would be:

mutate(var2 = var2[1])
like image 149
erc Avatar answered Nov 14 '22 20:11

erc


We can use data.table, but unlike dplyr, for groups that have all NA, we have to specify NA to return or else it will give Inf

library(data.table)
setDT(df_old)[, var2 := if(any(!is.na(var2))) min(var2, na.rm = TRUE) 
            else NA_integer_, by = id]
df_old    
#    id var1 var2
#1:  1    A   12
#2:  1    B   12
#3:  1    E   12
#4:  2    G   NA
#5:  2    J   NA
like image 1
akrun Avatar answered Nov 14 '22 21:11

akrun


By now there is tidyimpute package available in CRAN which looks like it might do the trick

"Functions and methods for imputing missing values (NA) in tables and list patterned after the tidyverse approach of 'dplyr' and 'rlang'; works with data.tables as well."

https://cran.r-project.org/web/packages/tidyimpute/tidyimpute.pdf

like image 1
juhariis Avatar answered Nov 14 '22 20:11

juhariis