Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregate and keep rows with NA

Tags:

r

aggregate

na

What I want to do: arrange data by name and class, sort by year, calculate the mean of the values and keep the valueMessage.

Sample data:

name <- c("a", "a", "b", "b")
class <- c("c1", "c1", "c3", "c3")
year <- c("2010", "2010", "2008", "2008")
value <- c(100, 33, 100, 90)
valueMessage <-c(NA, "meh", NA, NA)

df <- data.frame(name, class, year, value, valueMessage)
df 


dat <- aggregate(df$value, list(year = df$year, name = df$name, class=df$class, valueMessage=df$valueMessage), mean)
dat <- dat[with(dat, order(class, name, year)), ]

Returns this:

year name class valueMessage  x
1 2010    a    c1          meh 33

But I would like it to keep the NA and return something like this:

year name class valueMessage  x
1 2010    a    c1          meh, NA 66.5
2 2008    b    c3          NA 95
like image 205
user3804488 Avatar asked Jul 09 '14 07:07

user3804488


2 Answers

The thing is that you're trying to do two different aggregations: one for your "value" column, and one for your "valueMessage" column.

If that's the case, you would have to do each separately and merge them, or look at using a package, like "data.table" to help out.

With "data.table", you can do the following:

library(data.table)
DT <- data.table(df)
DT[, list(value = mean(value), 
          valueMessage = list(unique(valueMessage))), 
   by = list(year, name, class)]
#    year name class value valueMessage
# 1: 2010    a    c1  66.5       NA,meh
# 2: 2008    b    c3  95.0           NA

Note that the "valueMessage" column is a list. You could use paste if you wanted it to be a character vector instead. In other words:

DT[, list(value = mean(value), 
          valueMessage = paste(unique(valueMessage), collapse = ",")), 
   by = list(year, name, class)]

Here, I've assumed you're only interested in unique "valueMessage" values.


If you wanted to go with base R's aggregate, you could try something like the following:

M1 <- aggregate(value ~ year + name + class, df, mean)
M2 <- aggregate(as.character(valueMessage) ~ year + name + class,
                df, unique, na.action = na.pass)
merge(M1, M2)

I've used the "formula" method because I like the output better. as.character(valueMessage) was required because that column is presently a factor. The output of the valueMessage column would be, again, a list, but you can use an anonymous function and paste if you prefer a character vector of length 1 instead.

like image 172
A5C1D2H2I1M1N2O1R2T1 Avatar answered Sep 18 '22 16:09

A5C1D2H2I1M1N2O1R2T1


I like to use sqldf, because SQL is such a nice, simple and intuitive way to do this (without need of knowing thousands of R functions and their specifics and gotchas):

require(sqldf)
sqldf('
select year, name, class, avg(value), 
       group_concat(distinct case when valueMessage is NULL 
                                  then "NA" 
                                  else valueMessage 
                             end) as valueMessages
from df
group by class, name, year
')

On your modified example it will produce this output:

  year name class avg(value) valueMessages
1 2010    a    c1       66.5        NA,meh
2 2008    b    c3       95.0            NA
like image 4
Tomas Avatar answered Sep 20 '22 16:09

Tomas