I have a data.frame like this:
(t=structure(list(count = c(NA, 2, NA, NA, NA, 8, NA, NA, NA)), .Names = "count", row.names = c(NA,-9L), class = "data.frame"))
count
1 NA
2 2
3 NA
4 NA
5 NA
6 8
7 NA
8 NA
9 NA
It is great that R has the NA value but sometimes it bites me. I often forget about it and try to do subsetting like this
> t[t$count>=1,]
[1] NA 2 NA NA NA 8 NA NA NA
And the output includes all NA rows. (which I don't like)
After an hour of bug searching I change the code to this and that is what I want (imagine large dataframe a lots of non-NA resuls and only few "well-hidden" NAs):
> t[t$count>=1&!is.na(t$count),]
[1] 2 8
1. Is there a feature of the "as.integer" function so that I could do something like this:
t[as.integer.EXCLUDE.NA(t$count)>=1,]
I would want to use such feature in other as.xxxx functions as well. Basically force R to stop think like a statistician and treat NA differently (e.g., like NULL (I am not sure NULL would solve my issue) (this did not work: t$count[3]<-NULL for some reason)
2. or how would I run
transform(t, replace all NAs from count columns with 0)
or even better
transform(t, replace all NA from all numeric columns with 0 in t)
3. any generic comments on making R forget about NAs are welcomed?
I do not like the choices that were made at the time of designing how "[" handles NA values either. The approach I take when I want to extract values using a logical test is to wrap the logical expression in which
. This converts the result to a set of numbers and indexing succeeds without dragging along the unwanted NA's:
> t[ which(t$count >= 1), ]
[1] 2 8
# Or if you still want a dataframe result
> t[ which(t$count >= 1), , drop=FALSE]
count
2 2
6 8
I also use subset
since it handles NA's in the same manner as which(logical)
. The one gotcha is when which
is used with a "-" sign to retrieve the complement set. If there are no elements in the set satisfying the logical-condition, there will also be no elements in the -which(logical)
-form. So I just do not use the -which
combo:
> t[ -which(t$count < 1), , drop=FALSE]
[1] count
<0 rows> (or 0-length row.names)
> t[ which(t$count < 1), , drop=FALSE]
[1] count
<0 rows> (or 0-length row.names)
In data.table
it works as you'd like it to w.r.t. NA
, if I understand correctly. Also, you don't need to use $
and it doesn't mind if you forget the comma, either.
dt = as.data.table(t)
t[count>=1] # NA's are treated as FALSE
The list of differences between data.table
and data.frame
is in FAQ 2.17 here.
If you're thinking all these differences break compatibility, they don't. You can still pass a data.table
to any package and when those packages use standard R syntax on the data.table
, it still works.
Since you said large data.frame
, data.table
may be worth a look anyway.
These are the 3 points from FAQ 2.17 (where DT
means data.table
and DF
means data.frame
) :
DT[NA]
returns 1 row of NA
, but DF[NA]
returns a copy of the whole of DF
containing NA
throughout. The symbol NA
is type logical in R, and is
therefore recycled by [.data.frame
. Intention was probably
DF[NA_integer_]
. [.data.table
does this automatically for
convenience.
DT[c(TRUE,NA,FALSE)]
treats the NA
as FALSE
, but
DF[c(TRUE,NA,FALSE),]
returns an NA
row for each NA
DT[ColA==ColB]
is
simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA==ColB,]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With