Recently, I've faced a behaviour in table
function that was not what I was expected:
For example, let take the following vector:
ex_vec <- c("Non", "Non", "Nan", "Oui", "NaN", NA)
If I check for NA
values in my vector, "NaN"
is not considered one (as expected):
is.na(ex_vec)
# [1] FALSE FALSE FALSE FALSE FALSE TRUE
But if I tried to get the different values frequencies:
table(ex_vec)
#ex_vec
#Nan Non Oui
# 1 2 1
"NaN"
does not appear in the table.
However, if I "ask" table
to show the NA
values, I get this:
table(ex_vec, useNA="ifany")
#ex_vec
# Nan NaN Non Oui <NA>
# 1 1 2 1 1
So, the character strings "NaN"
is treated as a NA
value inside table
call, while being treated in the ouput as a not NA
value.
I know (it would be better and) I could solve my problem by converting my vector to a factor
but nonetheless, I'd really like to know what's going on here. Does anyone have an idea?
NaN, an acronym for Not a Number is an exception that usually occurs in the cases when an expression results in a number that is undefined or can't be represented. It is used for floating-point operations. For example: The square root of negative numbers.
In R, NaN stands for Not a Number. Typically NaN values occur when you attempt to perform some calculation that results in an invalid result.
In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number).
When factor
matches up levels for a vector it converts its exclude
list to the same type as the input vector:
exclude <- as.vector(exclude, typeof(x))
so if your exclude list has NaN
and your vector is character, this happens:
as.vector(exclude, typeof(letters))
[1] NA "NaN"
Oh dear. Now the real "NaN"
strings will be excluded.
To fix, use exclude=NA
in table
(and factor
if you are making factors that might hit this).
I do love this in the docs for factor
:
There are some anomalies associated with factors that have ‘NA’ as
a level. It is suggested to use them sparingly, e.g., only for
tabulation purposes.
Reassuring...
First idea coming to my mind was to have a look at table
definition which start by:
> table
function (..., exclude = if (useNA == "no") c(NA, NaN), useNA = c("no",
"ifany", "always"), dnn = list.names(...), deparse.level = 1)
{
Sounds logical, by default table exclude NA
and NaN
.
Digging within table code we see that if x
is not a factor it coerce it to a factor (nothing new here, it's said in the doc).
else {
a <- factor(a, exclude = exclude)
I didn't find anything else which could have impacted the input to coerce "NaN"
into NA
values.
So looking into factor to get the why we find the root cause:
> factor
function (x = character(), levels, labels = levels, exclude = NA,
ordered = is.ordered(x), nmax = NA)
{
[...] # Snipped for brievety
exclude <- as.vector(exclude, typeof(x))
x <- as.character(x)
levels <- levels[is.na(match(levels, exclude))] # defined in the snipped part above, is the sorted unique values of input vector, coerced to char.
f <- match(x, levels)
[...]
f
}
Here we got it, the exclude parameter, even being NA
values is coerced into a character vector.
So what happens is:
> ex_vec <- c("Non", "Non", "Nan", "Oui", "NaN", NA)
> excludes<-c(NA,NaN)
> as.vector(excludes,"character")
[1] NA "NaN"
> match(ex_vec,as.vector(excludes,"character"))
[1] NA NA NA NA 2 1
We do match character "NaN" as the exclude vector as been coerced to character before comparison.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With