Aggregate data frame while keeping original order, in a simple manner

Question

I'm having some trouble aggregating a data frame while keeping the groups in their original order (order based on first appearance in data frame). I've managed to get it right, but was hoping there is an easier way to go about it.

Here is a sample data set to work on:

set.seed(7)
sel.1 <- sample(1:5, 20, replace = TRUE)     # selection vector 1
sel.2 <- sample(1:5, 20, replace = TRUE)
add.1 <- sample(81:100)                      # additional vector 1
add.2 <- sample(81:100)
orig.df <- data.frame(sel.1, sel.2, add.1, add.2)

Some points to note: there are two selection columns to determine how the data is grouped together. They will be the same, and their names are known. I have only put two additional columns in this data, but there may be more. I have given the columns names starting with 'sel' and 'add' to make it easier to follow, but the actual data has different names (so while grep tricks are cool, they won't be useful here).

What I'm trying to do is aggregate the data frame into groups based on the 'sel' columns, and to sum together all the 'add' columns. This is simple enough using aggregate as follows:

# Get the names of all the additional columns
all.add <- names(orig.df)[!(names(orig.df)) %in% c("sel.1", "sel.2")]
aggr.df <- aggregate(orig.df[,all.add], 
                     by=list(sel.1 = orig.df$sel.1, sel.2 = orig.df$sel.2), sum)

The problem is that the result is ordered by the 'sel' columns; I want it ordered based on each group's first appearance in the original data.

Here are my best attempts at making this work:

## Attempt 1
# create indices for each row (x) and find the minimum index for each range
index.df <- aggregate(x = 1:nrow(orig.df),
                      by=list(sel.1 = orig.df$sel.1, sel.2 = orig.df$sel.2), min)
# Make sure the x vector (indices) are in the right range for aggr.df
index.order <- (1:nrow(index.df))[order(index.df$x)]
aggr.df[index.order,]

## Attempt 2
# get the unique groups. These are in the right order.
unique.sel <- unique(orig.df[,c("sel.1", "sel.2")])
# use sapply to effectively loop over data and sum additional columns.
sums <- t(sapply(1:nrow(unique.sel), function (x) {
    sapply(all.add, function (y) {
        sum(aggr.df[which(aggr.df$sel.1 == unique.sel$sel.1[x] & 
                          aggr.df$sel.2 == unique.sel$sel.2[x]), y])
        })
}))
data.frame(unique.sel, sums)

While these give me the right result, I was hoping that somebody could point out a simpler solution. It would be preferable if the solution works with the packages that come with the standard R installation.

I've looked at the the documentation for aggregate and match, but I couldn't find an answer (I guess I was hoping for something like a "keep.original.order" parameter for aggregate).

Any help would be much appreciated!

Update: (in case anybody stumbles across this)

Here is the cleanest way that I could find after trying for a few more days:

unique(data.frame(sapply(names(orig.df), function(x){
    if(x %in% c("sel.1", "sel.2")) orig.df[,x] else
    ave(orig.df[,x], orig.df$sel.1, orig.df$sel.2, FUN=sum)},
simplify=FALSE)))

Matt Dowle · Accepted Answer

It's short and simple in data.table. It returns the groups in first appearance order by default.

require(data.table)
DT = as.data.table(orig.df)
DT[, list(sum(add.1),sum(add.2)), by=list(sel.1,sel.2)]

    sel.1 sel.2  V1  V2
 1:     5     4  96  84
 2:     2     2 175 176
 3:     1     5 384 366
 4:     2     5  95  89
 5:     4     1 174 192
 6:     2     4  82  87
 7:     5     3  91  98
 8:     3     2 189 178
 9:     1     4 170 183
10:     1     1 100  91
11:     3     3  81  82
12:     5     5  83  88
13:     2     3  90  96

And this will be fast for large data, so no need to change your code later if you do find speed issues. The following alternative syntax is the easiest way to pass in which columns to group by.

DT[, lapply(.SD,sum), by=c("sel.1","sel.2")]

    sel.1 sel.2 add.1 add.2
 1:     5     4    96    84
 2:     2     2   175   176
 3:     1     5   384   366
 4:     2     5    95    89
 5:     4     1   174   192
 6:     2     4    82    87
 7:     5     3    91    98
 8:     3     2   189   178
 9:     1     4   170   183
10:     1     1   100    91
11:     3     3    81    82
12:     5     5    83    88
13:     2     3    90    96

or, by may also be a single comma separated string of column names, too :

DT[, lapply(.SD,sum), by="sel.1,sel.2"]

Backlin · Answer

A bit tough to read, but it gives you what you want and I added some comments to clarify.

# Define the columns you want to combine into the grouping variable
sel.col <- grepl("^sel", names(orig.df))
# Create the grouping variable
lev <- apply(orig.df[sel.col], 1, paste, collapse=" ")
# Split and sum up
data.frame(unique(orig.df[sel.col]),
           t(sapply(split(orig.df[!sel.col], factor(lev, levels=unique(lev))),
                    apply, 2, sum)))

The output looks like this

   sel.1 sel.2 add.1 add.2
1      5     4    96    84
2      2     2   175   176
3      1     5   384   366
5      2     5    95    89
6      4     1   174   192
7      2     4    82    87
8      5     3    91    98
10     3     2   189   178
11     1     4   170   183
14     1     1   100    91
17     3     3    81    82
19     5     5    83    88
20     2     3    90    96

Alfonso Gar · Answer

Looking for solutions to the same problem, I found a new one using aggregate(), but first converting the select variables as factors with the order you want.

all.add <- names(orig.df)[!(names(orig.df)) %in% c("sel.1", "sel.2")]

# Selection variables as factor with leves in the order you want
orig.df$sel.1 <- factor(orig.df$sel.1, levels = unique(orig.df$sel.1))
orig.df$sel.2 <- factor(orig.df$sel.2, levels = unique(orig.df$sel.2))

# This is ordered first by sel.1, then by sel.2
aggr.df.ordered <- aggregate(orig.df[,all.add], 
                             by=list(sel.1 = orig.df$sel.1, sel.2 = orig.df$sel.2), sum)

The output is:

   newvar add.1 add.2
1     1 1   100    91
2     1 4   170   183
3     1 5   384   366
4     2 2   175   176
5     2 3    90    96
6     2 4    82    87
7     2 5    95    89
8     3 2   189   178
9     3 3    81    82
10    4 1   174   192
11    5 3    91    98
12    5 4    96    84
13    5 5    83    88

To have it ordered for the first appearance of each combination of both variables, you need a new variable:

# ordered by first appearance of the two variables (needs a new variable)
orig.df$newvar <- paste(orig.df$sel.1, orig.df$sel.2)
orig.df$newvar <- factor(orig.df$newvar, levels = unique(orig.df$newvar))

aggr.df.ordered2 <- aggregate(orig.df[,all.add], 
                              by=list(newvar = orig.df$newvar,
                                      sel.1 = orig.df$sel.1, 
                                      sel.2 = orig.df$sel.2), sum)

which gives the output:

   newvar sel.2 sel.1 add.1 add.2
1     5 4     4     5    96    84
2     5 5     5     5    83    88
3     5 3     3     5    91    98
4     2 4     4     2    82    87
5     2 2     2     2   175   176
6     2 5     5     2    95    89
7     2 3     3     2    90    96
8     1 4     4     1   170   183
9     1 5     5     1   384   366
10    1 1     1     1   100    91
11    4 1     1     4   174   192
12    3 2     2     3   189   178
13    3 3     3     3    81    82

With this solution you do not need to install any new package.

Aggregate data frame while keeping original order, in a simple manner

Tags:

r

aggregate

data.table

Edward

3 Answers

Matt Dowle

Backlin

Alfonso Gar

Recent Activity

Donate For Us

Aggregate data frame while keeping original order, in a simple manner

Tags:

r

aggregate

data.table

Edward

3 Answers

Matt Dowle

Backlin

Alfonso Gar

Related questions

Recent Activity

Donate For Us