Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace NAs with mean of the same column of a data.table

Tags:

r

data.table

I want to replace NAs present in a column of a DATA TABLE with the mean of the same column. I am doing the following. But it is not working.

ww <- data.table(iris)

ww <- ww[1:5 , ]

ww[1,1] <- NA

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1:           NA         3.5          1.4         0.2  setosa
2:          4.9         3.0          1.4         0.2  setosa
3:          4.7         3.2          1.3         0.2  setosa
4:          4.6         3.1          1.5         0.2  setosa
5:          5.0         3.6          1.4         0.2  setosa


ww[is.na(Sepal.Length) , Sepal.Length:= mean(Sepal.Length, na.rm = T)]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1:          NaN         3.5          1.4         0.2  setosa
2:          4.9         3.0          1.4         0.2  setosa
3:          4.7         3.2          1.3         0.2  setosa
4:          4.6         3.1          1.5         0.2  setosa
5:          5.0         3.6          1.4         0.2  setosa

Why am I getting NaN in place of NA when it should have been the mean of the rest of the values (4.9, 4.7, 4.6, 5.0)?

What is the alternate of acheiving this in case something is wrong with this syntax?

I want to the syntax for the data table.

like image 965
user3664020 Avatar asked Sep 14 '15 11:09

user3664020


People also ask

How to replace NA with mean?

The easiest way to replace NA's with the mean in multiple columns is by using the functions mutate_at() and vars(). These functions let you select the columns in which you want to replace the missing values. To actually replace the NA with the mean, you can use the replace_na() and mean() function.

How to replace the NA values in R?

The classic way to replace NA's in R is by using the IS.NA() function. The IS.NA() function takes a vector or data frame as input and returns a logical object that indicates whether a value is missing (TRUE or VALUE). Next, you can use this logical object to create a subset of the missing values and assign them a zero.


2 Answers

na.aggregate in the zoo package replaces NAs with the mean of the non-NAs in the same column:

library(zoo)

ww[, Sepal.Length := na.aggregate(Sepal.Length)]
like image 169
G. Grothendieck Avatar answered Oct 05 '22 01:10

G. Grothendieck


While the zoo answer is pretty nice it requires new dependency.
Using just data.table you could do the following.

library(data.table)

# prepare data
ww = data.table(iris[1:5,])
ww[1, Sepal.Length := NA]

# solution
ww[, Sepal.Length.mean := mean(Sepal.Length, na.rm = TRUE) # calculate mean
   ][is.na(Sepal.Length), Sepal.Length := Sepal.Length.mean # replace NA with mean
     ][, Sepal.Length.mean := NULL # remove mean col
       ][] # just prints

While it may looks biggish comparing to zoo's, it is performance efficient as all steps are made using update by reference :=. It can also be easily tuned to replace NA with mean by group, just using by argument in data.table.

like image 41
jangorecki Avatar answered Oct 05 '22 02:10

jangorecki