Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In R, is there a way to handle NA in an integer column of a data.frame so that NA values are not included when subsetting?

I have a data.frame like this:

(t=structure(list(count = c(NA, 2, NA, NA, NA, 8, NA, NA, NA)), .Names = "count", row.names = c(NA,-9L), class = "data.frame"))
  count
1    NA
2     2
3    NA
4    NA
5    NA
6     8
7    NA
8    NA
9    NA

It is great that R has the NA value but sometimes it bites me. I often forget about it and try to do subsetting like this

> t[t$count>=1,]
[1] NA  2 NA NA NA  8 NA NA NA

And the output includes all NA rows. (which I don't like)

After an hour of bug searching I change the code to this and that is what I want (imagine large dataframe a lots of non-NA resuls and only few "well-hidden" NAs):

> t[t$count>=1&!is.na(t$count),]
[1] 2 8

1. Is there a feature of the "as.integer" function so that I could do something like this:

t[as.integer.EXCLUDE.NA(t$count)>=1,]

I would want to use such feature in other as.xxxx functions as well. Basically force R to stop think like a statistician and treat NA differently (e.g., like NULL (I am not sure NULL would solve my issue) (this did not work: t$count[3]<-NULL for some reason)

2. or how would I run

transform(t, replace all NAs from count columns with 0)

or even better

transform(t, replace all NA from all numeric columns with 0 in t)

3. any generic comments on making R forget about NAs are welcomed?

like image 516
userJT Avatar asked Mar 02 '12 17:03

userJT


2 Answers

I do not like the choices that were made at the time of designing how "[" handles NA values either. The approach I take when I want to extract values using a logical test is to wrap the logical expression in which. This converts the result to a set of numbers and indexing succeeds without dragging along the unwanted NA's:

> t[ which(t$count >= 1), ]
[1] 2 8
# Or if you still want a dataframe result
> t[ which(t$count >= 1), , drop=FALSE]
  count
2     2
6     8

I also use subset since it handles NA's in the same manner as which(logical). The one gotcha is when which is used with a "-" sign to retrieve the complement set. If there are no elements in the set satisfying the logical-condition, there will also be no elements in the -which(logical)-form. So I just do not use the -which combo:

> t[ -which(t$count < 1), , drop=FALSE]
[1] count
<0 rows> (or 0-length row.names)
> t[ which(t$count < 1), , drop=FALSE]
[1] count
<0 rows> (or 0-length row.names)
like image 163
IRTFM Avatar answered Oct 02 '22 16:10

IRTFM


In data.table it works as you'd like it to w.r.t. NA, if I understand correctly. Also, you don't need to use $ and it doesn't mind if you forget the comma, either.

dt = as.data.table(t)
t[count>=1]   # NA's are treated as FALSE

The list of differences between data.table and data.frame is in FAQ 2.17 here.

If you're thinking all these differences break compatibility, they don't. You can still pass a data.table to any package and when those packages use standard R syntax on the data.table, it still works.

Since you said large data.frame, data.table may be worth a look anyway.

These are the 3 points from FAQ 2.17 (where DT means data.table and DF means data.frame) :

  • DT[NA] returns 1 row of NA, but DF[NA] returns a copy of the whole of DF containing NA throughout. The symbol NA is type logical in R, and is therefore recycled by [.data.frame. Intention was probably DF[NA_integer_]. [.data.table does this automatically for convenience.

  • DT[c(TRUE,NA,FALSE)] treats the NA as FALSE, but DF[c(TRUE,NA,FALSE),] returns an NA row for each NA

  • DT[ColA==ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA==ColB,]

like image 39
Matt Dowle Avatar answered Oct 02 '22 15:10

Matt Dowle