I use the following data.frame as an example:
d <- data.frame(x=c(1,NA), y=c(2,3))
I'd like to sum up the values of y by the variable x. Since there is no common value of x, I would expect aggregation to just give me the original data.frame back, where NA is treated as a group. But aggregation gives me the following results.
>aggregate(y ~ x, data=d, FUN=sum)
x y
1 1 2
I've read the documentation about changing the default actions of na.action, but it doesn't seem to give me anything meaningful.
>aggregate(y ~ x, data=d, FUN=sum, na.action=na.pass)
x y
1 1 2
What is going on? I don't seem to understand what na.pass is doing in this case. Is there an option to accomplish what I want in R? Any help would be greatly appreciated.
aggregate
makes use of tapply
, which in turn makes use of factor
on its grouping variable.
But, look at what happens with NA
values in factor
:
factor(c(1, 2, NA))
# [1] 1 2 <NA>
# Levels: 1 2
Note the levels
. You can make use of addNA
to keep the NA
:
addNA(factor(c(1, 2, NA)))
# [1] 1 2 <NA>
# Levels: 1 2 <NA>
Thus, you would probably need to do something like:
aggregate(y ~ addNA(x), d, sum)
# addNA(x) y
# 1 1 2
# 2 <NA> 3
Or something like:
d$x <- addNA(factor(d$x))
str(d)
# 'data.frame': 2 obs. of 2 variables:
# $ x: Factor w/ 2 levels "1",NA: 1 2
# $ y: num 2 3
aggregate(y ~ x, d, sum)
# x y
# 1 1 2
# 2 <NA> 3
(Alternatively, make the upgrade to something like "data.table", which will not just be faster than aggregate
, but which will also give you more consistent behavior with NA
values. No need to pay heed to whether you're using the formula method of aggregate
or not.)
library(data.table)
as.data.table(d)[, sum(y), by = x]
# x V1
# 1: 1 2
# 2: NA 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With