Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Flatten/denormalize the result of the R aggregate function

I'm fairly new to R and I'm trying to use aggregate to perform some time series shaping on a dataframe, per subject and for each metric in my dataset. This works beautifully, but I find that the result isn't in a format that's very easy to use. I'd like to be able to transform the results back into the same format as the original dataframe.

Using the iris dataset as an example:

# Split into two data frames, one for metrics, the other for grouping
iris_species = subset(iris, select=Species)
iris_metrics = subset(iris, select=-Species)
# Compute diff for each metric with respect to its species
iris_diff = aggregate(iris_metrics, iris_species, diff)

I'm just using diff to illustrate that I have a function that shapes the time series, so I get a time series of possibly different length as a result and definitely not a single aggregate value (e.g. mean).

I'd like to transform the result, which seems to be a matrix that has list valued cells to the original "flat" dataframe.

I'm mostly curious about how to manage this with results from aggregate, but I'd be ok with solutions that do everything in plyr or reshape.

like image 704
Vince Gatto Avatar asked Mar 01 '13 22:03

Vince Gatto


People also ask

What is R aggregate function?

Aggregate() Function in R Splits the data into subsets, computes summary statistics for each subsets and returns the result in a group by form. Aggregate function in R is similar to group by in SQL. Aggregate() function is useful in performing all the aggregate operations like sum,count,mean, minimum and Maximum.

How do I scale a data set in R?

In R, you can use the scale() function to scale the values in a vector, matrix, or data frame. You will almost always receive meaningless results if you do not normalize the vectors or columns you are utilizing. Scale() is a built-in R function that centers and/or scales the columns of a numeric matrix by default.


2 Answers

As you might know, aggregate works on one column at a time. A single value is expected, and odd things happen if you return vectors of length different from 1.

You can split this up with by to get the data (with fewer rows than in iris) and put it back together:

b <- by(iris_metrics, iris_species, FUN=function(x) diff(as.matrix(x)))
do.call(rbind, lapply(names(b), function(x) data.frame(Species=x, b[[x]])))

diff(as.matrix) is used as this does what you want for matrices (but not for data frames). The key point is that the function returns a different number of rows than are in each Species in iris.

like image 165
Matthew Lundberg Avatar answered Oct 07 '22 05:10

Matthew Lundberg


The best solution I could think of in this case is data.table:

require(data.table)
dt <- data.table(iris, key="Species")
dt.out <- dt[, lapply(.SD, diff), by=Species]

And if you want a plyr solution, then the idea is basically the same. Split by Species and apply diff to each column.

require(plyr)
ddply(iris, .(Species), function(x) do.call(cbind, lapply(x[,1:4], diff)))
like image 37
Arun Avatar answered Oct 07 '22 04:10

Arun