Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

reshape2: multiple results of aggregation function?

From what I read, *cast operations in reshape2 lost their result_variable feature. Hadley hints at using plyr for this purpose (appending multiple result columns to the input data frame). How would I realize the documentation example ...

aqm <- melt(airquality, id=c("month", "day"), na.rm=TRUE)
cast(aqm, month ~ variable + result_variable, range)

using reshape2 (dcast) and plyr (ddply)?

like image 641
mrcalvin Avatar asked Jan 31 '14 09:01

mrcalvin


People also ask

Which function is used to aggregate values from multiple columns in to one?

The aggregate() function in R is used to produce summary statistics for one or more variables in a data frame or a data.

What is reshape 2?

reshape2 is an R package written by Hadley Wickham that makes it easy to transform data between wide and long formats.

What does dcast do in r?

dcast: Convert data between wide and long forms.

What is aggregate function in Rstudio?

aggregate() function is used to get the summary statistics of the data by group. The statistics include mean, min, sum. max etc.


2 Answers

This question has multiple answers, due to the flexibility of the 'reshape2' and 'plyr' packages. I will show one of the easiest examples to understand here:

library(reshape2)
library(plyr)

aqm <- melt(airquality, id=c("Month", "Day"), na.rm=TRUE)
aqm_ply <- ddply(aqm, .(Month, variable), summarize, min=min(value), max=max(value))
aqm_melt <- melt(aqm_ply, id=c("Month", "variable"), variable.name="variable2")
dcast(aqm_melt, Month ~ variable + variable2)

#   Month Ozone_min Ozone_max Solar.R_min Solar.R_max Wind_min Wind_max Temp_min  Temp_max
# 1     5         1       115           8         334      5.7     20.1       56        81
# 2     6        12        71          31         332      1.7     20.7       65        93
# 3     7         7       135           7         314      4.1     14.9       73        92
# 4     8         9       168          24         273      2.3     15.5       72        97
# 5     9         7        96          14         259      2.8     16.6       63        93

Step 1: Let's break it down into steps. First, let's leave the definition of 'aqm' alone and work from the melted data. This will make the example easier to understand.

aqm <- melt(airquality, id=c("Month", "Day"), na.rm=TRUE)

#     Month Day variable value
# 1       5   1    Ozone  41.0
# 2       5   2    Ozone  36.0
# 3       5   3    Ozone  12.0
# 4       5   4    Ozone  18.0
# ...
# 612     9  30     Temp  68.0

Step 2: Now, we want to replace the 'value' column with 'min' and 'max' columns. We can accomplish this with the 'ddply' function from the 'plyr' package. To do this, we use the 'ddply' function (data frame as input, data frame as output, hence "dd"-ply). We first specify the data.

ddply(aqm,

And then we specify the variables we want to use to group our data, 'Month' and 'variable'. We use the . function to refer to this variables directly, instead of referring to the values they contain.

ddply(aqm, .(Month, variable),

Now we need to choose an aggregating function. We choose the summarize function here, because we have columns ('Day' and 'value') that we don't want to include in our final data. The summarize function will strip away all of the original, non-grouping columns.

ddply(aqm, .(Month, variable), summarize,

Finally, we specify the calculation to do for each group. We can refer to the columns of the original data frame ('aqm'), even though they will not be contained in our final data frame. This is how it looks:

aqm_ply <- ddply(aqm, .(Month, variable), summarize, min=min(value), max=max(value))

#    Month variable  min   max
# 1      5    Ozone  1.0 115.0
# 2      5  Solar.R  8.0 334.0
# 3      5     Wind  5.7  20.1
# 4      5     Temp 56.0  81.0
# 5      6    Ozone 12.0  71.0
# 6      6  Solar.R 31.0 332.0
# 7      6     Wind  1.7  20.7
# 8      6     Temp 65.0  93.0
# 9      7    Ozone  7.0 135.0
# 10     7  Solar.R  7.0 314.0
# 11     7     Wind  4.1  14.9
# 12     7     Temp 73.0  92.0
# 13     8    Ozone  9.0 168.0
# 14     8  Solar.R 24.0 273.0
# 15     8     Wind  2.3  15.5
# 16     8     Temp 72.0  97.0
# 17     9    Ozone  7.0  96.0
# 18     9  Solar.R 14.0 259.0
# 19     9     Wind  2.8  16.6
# 20     9     Temp 63.0  93.0

Step 3: We can see that the data is vastly reduced, since the ddply function has aggregated the lines. Now we need to melt the data again, so we can get our second variable for the final data frame. Note that we need to specify a new variable.name argument, so we don't have two columns named "variable".

aqm_melt <- melt(aqm_ply, id=c("Month", "variable"), variable.name="variable2")

    #    Month variable variable2 value
# 1      5    Ozone       min   1.0
# 2      5  Solar.R       min   8.0
# 3      5     Wind       min   5.7
# 4      5     Temp       min  56.0
# 5      6    Ozone       min  12.0
# ...
# 37     9    Ozone       max  96.0
# 38     9  Solar.R       max 259.0
# 39     9     Wind       max  16.6
# 40     9     Temp       max  93.0

Step 4: And we can finally wrap it all up by casting our data into the final form.

dcast(aqm_melt, Month ~ variable + variable2)

#   Month Ozone_min Ozone_max Solar.R_min Solar.R_max Wind_min Wind_max Temp_min  Temp_max
# 1     5         1       115           8         334      5.7     20.1       56        81
# 2     6        12        71          31         332      1.7     20.7       65        93
# 3     7         7       135           7         314      4.1     14.9       73        92
# 4     8         9       168          24         273      2.3     15.5       72        97
# 5     9         7        96          14         259      2.8     16.6       63        93

Hopefully, this example will give you enough understanding to get you started. Be aware that a new, data frame-optimized version of the 'plyr' package is being actively developed under the name 'dplyr', so you may want to be ready to convert your code to the new package after it becomes more fully fledged.

like image 169
Dinre Avatar answered Sep 28 '22 03:09

Dinre


I think that the other answers should have you covered in terms of how to use "plyr" or "dplyr" (and I would encourage you to continue looking in that direction).

For fun, here's a wrapper around dcast to let you specify multiple functions. It doesn't work with functions that return multiple values (like range) and it requires you to use a named list of functions.

dcastMult <- function(data, formula, value.var = "value", 
                   funs = list("min" = min, "max" = max)) {
  require(reshape2)
  if (is.null(names(funs)) | any(names(funs) == "")) stop("funs must be named")
  Form <- formula(formula)
  LHS <- as.character(Form[[2]])
  if (length(LHS) > 1) LHS <- LHS[-1]
  temp <- lapply(seq_along(funs), function(Z) {
    T1 <- dcast(data, Form, value.var = value.var, 
                fun.aggregate=match.fun(funs[[Z]]), fill = 0)
    Names <- !names(T1) %in% LHS
    names(T1)[Names] <- paste(names(T1)[Names], names(funs)[[Z]], sep = "_")
    T1
  })
  Reduce(function(x, y) merge(x, y), temp)
}

It looks like a bit of a mess, but the result is that you get to stick with the same syntax you're familiar with, while getting to use multiple aggregation functions. The "names" for the funs argument are used as the suffixes in the resulting names. Anonymous functions can be specified as expected, for example maxSq = function(x) max(x)^2.

dcastMult(aqm, month ~ variable, value.var="value",
       funs = list("min" = min, "max" = max))
#   month ozone_min solar.r_min wind_min temp_min ozone_max solar.r_max wind_max temp_max
# 1     5         1           8      5.7       56       115         334     20.1       81
# 2     6        12          31      1.7       65        71         332     20.7       93
# 3     7         7           7      4.1       73       135         314     14.9       92
# 4     8         9          24      2.3       72       168         273     15.5       97
# 5     9         7          14      2.8       63        96         259     16.6       93
like image 45
A5C1D2H2I1M1N2O1R2T1 Avatar answered Sep 28 '22 03:09

A5C1D2H2I1M1N2O1R2T1