I'm having some trouble aggregating a data frame while keeping the groups in their original order (order based on first appearance in data frame). I've managed to get it right, but was hoping there is an easier way to go about it.
Here is a sample data set to work on:
set.seed(7)
sel.1 <- sample(1:5, 20, replace = TRUE) # selection vector 1
sel.2 <- sample(1:5, 20, replace = TRUE)
add.1 <- sample(81:100) # additional vector 1
add.2 <- sample(81:100)
orig.df <- data.frame(sel.1, sel.2, add.1, add.2)
Some points to note: there are two selection columns to determine how the data is grouped together. They will be the same, and their names are known. I have only put two additional columns in this data, but there may be more. I have given the columns names starting with 'sel' and 'add' to make it easier to follow, but the actual data has different names (so while grep
tricks are cool, they won't be useful here).
What I'm trying to do is aggregate the data frame into groups based on the 'sel' columns, and to sum together all the 'add' columns. This is simple enough using aggregate
as follows:
# Get the names of all the additional columns
all.add <- names(orig.df)[!(names(orig.df)) %in% c("sel.1", "sel.2")]
aggr.df <- aggregate(orig.df[,all.add],
by=list(sel.1 = orig.df$sel.1, sel.2 = orig.df$sel.2), sum)
The problem is that the result is ordered by the 'sel' columns; I want it ordered based on each group's first appearance in the original data.
Here are my best attempts at making this work:
## Attempt 1
# create indices for each row (x) and find the minimum index for each range
index.df <- aggregate(x = 1:nrow(orig.df),
by=list(sel.1 = orig.df$sel.1, sel.2 = orig.df$sel.2), min)
# Make sure the x vector (indices) are in the right range for aggr.df
index.order <- (1:nrow(index.df))[order(index.df$x)]
aggr.df[index.order,]
## Attempt 2
# get the unique groups. These are in the right order.
unique.sel <- unique(orig.df[,c("sel.1", "sel.2")])
# use sapply to effectively loop over data and sum additional columns.
sums <- t(sapply(1:nrow(unique.sel), function (x) {
sapply(all.add, function (y) {
sum(aggr.df[which(aggr.df$sel.1 == unique.sel$sel.1[x] &
aggr.df$sel.2 == unique.sel$sel.2[x]), y])
})
}))
data.frame(unique.sel, sums)
While these give me the right result, I was hoping that somebody could point out a simpler solution. It would be preferable if the solution works with the packages that come with the standard R installation.
I've looked at the the documentation for aggregate
and match
, but I couldn't find an answer (I guess I was hoping for something like a "keep.original.order" parameter for aggregate
).
Any help would be much appreciated!
Update: (in case anybody stumbles across this)
Here is the cleanest way that I could find after trying for a few more days:
unique(data.frame(sapply(names(orig.df), function(x){
if(x %in% c("sel.1", "sel.2")) orig.df[,x] else
ave(orig.df[,x], orig.df$sel.1, orig.df$sel.2, FUN=sum)},
simplify=FALSE)))
It's short and simple in data.table. It returns the groups in first appearance order by default.
require(data.table)
DT = as.data.table(orig.df)
DT[, list(sum(add.1),sum(add.2)), by=list(sel.1,sel.2)]
sel.1 sel.2 V1 V2
1: 5 4 96 84
2: 2 2 175 176
3: 1 5 384 366
4: 2 5 95 89
5: 4 1 174 192
6: 2 4 82 87
7: 5 3 91 98
8: 3 2 189 178
9: 1 4 170 183
10: 1 1 100 91
11: 3 3 81 82
12: 5 5 83 88
13: 2 3 90 96
And this will be fast for large data, so no need to change your code later if you do find speed issues. The following alternative syntax is the easiest way to pass in which columns to group by.
DT[, lapply(.SD,sum), by=c("sel.1","sel.2")]
sel.1 sel.2 add.1 add.2
1: 5 4 96 84
2: 2 2 175 176
3: 1 5 384 366
4: 2 5 95 89
5: 4 1 174 192
6: 2 4 82 87
7: 5 3 91 98
8: 3 2 189 178
9: 1 4 170 183
10: 1 1 100 91
11: 3 3 81 82
12: 5 5 83 88
13: 2 3 90 96
or, by
may also be a single comma separated string of column names, too :
DT[, lapply(.SD,sum), by="sel.1,sel.2"]
A bit tough to read, but it gives you what you want and I added some comments to clarify.
# Define the columns you want to combine into the grouping variable
sel.col <- grepl("^sel", names(orig.df))
# Create the grouping variable
lev <- apply(orig.df[sel.col], 1, paste, collapse=" ")
# Split and sum up
data.frame(unique(orig.df[sel.col]),
t(sapply(split(orig.df[!sel.col], factor(lev, levels=unique(lev))),
apply, 2, sum)))
The output looks like this
sel.1 sel.2 add.1 add.2
1 5 4 96 84
2 2 2 175 176
3 1 5 384 366
5 2 5 95 89
6 4 1 174 192
7 2 4 82 87
8 5 3 91 98
10 3 2 189 178
11 1 4 170 183
14 1 1 100 91
17 3 3 81 82
19 5 5 83 88
20 2 3 90 96
Looking for solutions to the same problem, I found a new one using aggregate(), but first converting the select variables as factors with the order you want.
all.add <- names(orig.df)[!(names(orig.df)) %in% c("sel.1", "sel.2")]
# Selection variables as factor with leves in the order you want
orig.df$sel.1 <- factor(orig.df$sel.1, levels = unique(orig.df$sel.1))
orig.df$sel.2 <- factor(orig.df$sel.2, levels = unique(orig.df$sel.2))
# This is ordered first by sel.1, then by sel.2
aggr.df.ordered <- aggregate(orig.df[,all.add],
by=list(sel.1 = orig.df$sel.1, sel.2 = orig.df$sel.2), sum)
The output is:
newvar add.1 add.2
1 1 1 100 91
2 1 4 170 183
3 1 5 384 366
4 2 2 175 176
5 2 3 90 96
6 2 4 82 87
7 2 5 95 89
8 3 2 189 178
9 3 3 81 82
10 4 1 174 192
11 5 3 91 98
12 5 4 96 84
13 5 5 83 88
To have it ordered for the first appearance of each combination of both variables, you need a new variable:
# ordered by first appearance of the two variables (needs a new variable)
orig.df$newvar <- paste(orig.df$sel.1, orig.df$sel.2)
orig.df$newvar <- factor(orig.df$newvar, levels = unique(orig.df$newvar))
aggr.df.ordered2 <- aggregate(orig.df[,all.add],
by=list(newvar = orig.df$newvar,
sel.1 = orig.df$sel.1,
sel.2 = orig.df$sel.2), sum)
which gives the output:
newvar sel.2 sel.1 add.1 add.2
1 5 4 4 5 96 84
2 5 5 5 5 83 88
3 5 3 3 5 91 98
4 2 4 4 2 82 87
5 2 2 2 2 175 176
6 2 5 5 2 95 89
7 2 3 3 2 90 96
8 1 4 4 1 170 183
9 1 5 5 1 384 366
10 1 1 1 1 100 91
11 4 1 1 4 174 192
12 3 2 2 3 189 178
13 3 3 3 3 81 82
With this solution you do not need to install any new package.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With