Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregate data frame while keeping original order, in a simple manner

I'm having some trouble aggregating a data frame while keeping the groups in their original order (order based on first appearance in data frame). I've managed to get it right, but was hoping there is an easier way to go about it.

Here is a sample data set to work on:

set.seed(7)
sel.1 <- sample(1:5, 20, replace = TRUE)     # selection vector 1
sel.2 <- sample(1:5, 20, replace = TRUE)
add.1 <- sample(81:100)                      # additional vector 1
add.2 <- sample(81:100)
orig.df <- data.frame(sel.1, sel.2, add.1, add.2)

Some points to note: there are two selection columns to determine how the data is grouped together. They will be the same, and their names are known. I have only put two additional columns in this data, but there may be more. I have given the columns names starting with 'sel' and 'add' to make it easier to follow, but the actual data has different names (so while grep tricks are cool, they won't be useful here).

What I'm trying to do is aggregate the data frame into groups based on the 'sel' columns, and to sum together all the 'add' columns. This is simple enough using aggregate as follows:

# Get the names of all the additional columns
all.add <- names(orig.df)[!(names(orig.df)) %in% c("sel.1", "sel.2")]
aggr.df <- aggregate(orig.df[,all.add], 
                     by=list(sel.1 = orig.df$sel.1, sel.2 = orig.df$sel.2), sum)

The problem is that the result is ordered by the 'sel' columns; I want it ordered based on each group's first appearance in the original data.

Here are my best attempts at making this work:

## Attempt 1
# create indices for each row (x) and find the minimum index for each range
index.df <- aggregate(x = 1:nrow(orig.df),
                      by=list(sel.1 = orig.df$sel.1, sel.2 = orig.df$sel.2), min)
# Make sure the x vector (indices) are in the right range for aggr.df
index.order <- (1:nrow(index.df))[order(index.df$x)]
aggr.df[index.order,]

## Attempt 2
# get the unique groups. These are in the right order.
unique.sel <- unique(orig.df[,c("sel.1", "sel.2")])
# use sapply to effectively loop over data and sum additional columns.
sums <- t(sapply(1:nrow(unique.sel), function (x) {
    sapply(all.add, function (y) {
        sum(aggr.df[which(aggr.df$sel.1 == unique.sel$sel.1[x] & 
                          aggr.df$sel.2 == unique.sel$sel.2[x]), y])
        })
}))
data.frame(unique.sel, sums)

While these give me the right result, I was hoping that somebody could point out a simpler solution. It would be preferable if the solution works with the packages that come with the standard R installation.

I've looked at the the documentation for aggregate and match, but I couldn't find an answer (I guess I was hoping for something like a "keep.original.order" parameter for aggregate).

Any help would be much appreciated!


Update: (in case anybody stumbles across this)

Here is the cleanest way that I could find after trying for a few more days:

unique(data.frame(sapply(names(orig.df), function(x){
    if(x %in% c("sel.1", "sel.2")) orig.df[,x] else
    ave(orig.df[,x], orig.df$sel.1, orig.df$sel.2, FUN=sum)},
simplify=FALSE)))
like image 838
Edward Avatar asked Aug 08 '12 19:08

Edward


3 Answers

It's short and simple in data.table. It returns the groups in first appearance order by default.

require(data.table)
DT = as.data.table(orig.df)
DT[, list(sum(add.1),sum(add.2)), by=list(sel.1,sel.2)]

    sel.1 sel.2  V1  V2
 1:     5     4  96  84
 2:     2     2 175 176
 3:     1     5 384 366
 4:     2     5  95  89
 5:     4     1 174 192
 6:     2     4  82  87
 7:     5     3  91  98
 8:     3     2 189 178
 9:     1     4 170 183
10:     1     1 100  91
11:     3     3  81  82
12:     5     5  83  88
13:     2     3  90  96

And this will be fast for large data, so no need to change your code later if you do find speed issues. The following alternative syntax is the easiest way to pass in which columns to group by.

DT[, lapply(.SD,sum), by=c("sel.1","sel.2")]

    sel.1 sel.2 add.1 add.2
 1:     5     4    96    84
 2:     2     2   175   176
 3:     1     5   384   366
 4:     2     5    95    89
 5:     4     1   174   192
 6:     2     4    82    87
 7:     5     3    91    98
 8:     3     2   189   178
 9:     1     4   170   183
10:     1     1   100    91
11:     3     3    81    82
12:     5     5    83    88
13:     2     3    90    96

or, by may also be a single comma separated string of column names, too :

DT[, lapply(.SD,sum), by="sel.1,sel.2"]
like image 58
Matt Dowle Avatar answered Nov 05 '22 06:11

Matt Dowle


A bit tough to read, but it gives you what you want and I added some comments to clarify.

# Define the columns you want to combine into the grouping variable
sel.col <- grepl("^sel", names(orig.df))
# Create the grouping variable
lev <- apply(orig.df[sel.col], 1, paste, collapse=" ")
# Split and sum up
data.frame(unique(orig.df[sel.col]),
           t(sapply(split(orig.df[!sel.col], factor(lev, levels=unique(lev))),
                    apply, 2, sum)))

The output looks like this

   sel.1 sel.2 add.1 add.2
1      5     4    96    84
2      2     2   175   176
3      1     5   384   366
5      2     5    95    89
6      4     1   174   192
7      2     4    82    87
8      5     3    91    98
10     3     2   189   178
11     1     4   170   183
14     1     1   100    91
17     3     3    81    82
19     5     5    83    88
20     2     3    90    96
like image 20
Backlin Avatar answered Nov 05 '22 06:11

Backlin


Looking for solutions to the same problem, I found a new one using aggregate(), but first converting the select variables as factors with the order you want.

all.add <- names(orig.df)[!(names(orig.df)) %in% c("sel.1", "sel.2")]

# Selection variables as factor with leves in the order you want
orig.df$sel.1 <- factor(orig.df$sel.1, levels = unique(orig.df$sel.1))
orig.df$sel.2 <- factor(orig.df$sel.2, levels = unique(orig.df$sel.2))

# This is ordered first by sel.1, then by sel.2
aggr.df.ordered <- aggregate(orig.df[,all.add], 
                             by=list(sel.1 = orig.df$sel.1, sel.2 = orig.df$sel.2), sum)

The output is:

   newvar add.1 add.2
1     1 1   100    91
2     1 4   170   183
3     1 5   384   366
4     2 2   175   176
5     2 3    90    96
6     2 4    82    87
7     2 5    95    89
8     3 2   189   178
9     3 3    81    82
10    4 1   174   192
11    5 3    91    98
12    5 4    96    84
13    5 5    83    88

To have it ordered for the first appearance of each combination of both variables, you need a new variable:

# ordered by first appearance of the two variables (needs a new variable)
orig.df$newvar <- paste(orig.df$sel.1, orig.df$sel.2)
orig.df$newvar <- factor(orig.df$newvar, levels = unique(orig.df$newvar))

aggr.df.ordered2 <- aggregate(orig.df[,all.add], 
                              by=list(newvar = orig.df$newvar,
                                      sel.1 = orig.df$sel.1, 
                                      sel.2 = orig.df$sel.2), sum)

which gives the output:

   newvar sel.2 sel.1 add.1 add.2
1     5 4     4     5    96    84
2     5 5     5     5    83    88
3     5 3     3     5    91    98
4     2 4     4     2    82    87
5     2 2     2     2   175   176
6     2 5     5     2    95    89
7     2 3     3     2    90    96
8     1 4     4     1   170   183
9     1 5     5     1   384   366
10    1 1     1     1   100    91
11    4 1     1     4   174   192
12    3 2     2     3   189   178
13    3 3     3     3    81    82

With this solution you do not need to install any new package.

like image 21
Alfonso Gar Avatar answered Nov 05 '22 06:11

Alfonso Gar