Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

when to use na.omit versus complete.cases

Tags:

r

I have following code comparing na.omit and complete.cases:

> mydf
  AA BB
1  2  2
2 NA  5
3  6  8
4  5 NA
5  9  6
6 NA  1
> 
> 
> na.omit(mydf)
  AA BB
1  2  2
3  6  8
5  9  6
> 
> mydf[complete.cases(mydf),]
  AA BB
1  2  2
3  6  8
5  9  6
> 
> str(na.omit(mydf))
'data.frame':   3 obs. of  2 variables:
 $ AA: int  2 6 9
 $ BB: int  2 8 6
 - attr(*, "na.action")=Class 'omit'  Named int [1:3] 2 4 6
  .. ..- attr(*, "names")= chr [1:3] "2" "4" "6"
> 
> 
> str(mydf[complete.cases(mydf),])
'data.frame':   3 obs. of  2 variables:
 $ AA: int  2 6 9
 $ BB: int  2 8 6
> 
> identical(na.omit(mydf), mydf[complete.cases(mydf),])
[1] FALSE

Are there any situations where one or the other should be used or effectively they are the same?

like image 654
rnso Avatar asked Apr 06 '15 13:04

rnso


1 Answers

It is true that na.omit and complete.cases are functionally the same when complete.cases is applied to all columns of your object (e.g. data.frame):

R> all.equal(na.omit(mydf),mydf[complete.cases(mydf),],check.attributes=F)
[1] TRUE

But I see two fundamental differences between these two functions (there may very well be additional differences). First, na.omit adds an na.action attribute to the object, providing information about how the data was modified WRT missing values. I imagine a trivial use case for this as something like:

foo <- function(data) {
  data <- na.omit(data)
  n <- length(attributes(na.omit(data))$row.names)
  message(sprintf("Note: %i rows removed due to missing values.",n))
  # do something with data
}
##
R> foo(mydf)
Note: 3 rows removed due to missing values.

where we provide the user with some relevant information. I'm sure a more creative person could (and probably has) find (found) better uses of the na.action attribute, but you get the point.

Second, complete.cases allows for partial manipulation of missing values, e.g.

R> mydf[complete.cases(mydf[,1]),]
  AA BB
1  2  2
3  6  8
4  5 NA
5  9  6

Depending on what your variables represent, you may feel comfortable imputing values for column BB, but not for column AA, so using complete.cases like this allows you finer control.

like image 126
nrussell Avatar answered Oct 22 '22 16:10

nrussell