Following on from my questions:
1. Identifying whether a set of variables uniquely identifies each row of the data or not;
2. Tagging all rows that are duplicates in terms of a given set of variables,
I would now like to aggregate/consolidate all the duplicate rows in terms of a given set of variables, by taking their sum.
There is some guidance on how to do this here, but when there are a large number of levels of the variables that form the index, the ddply
method recommended there is slow, as it was in the case where I was trying to tag all the duplicates by a given set of variables.
# Values of (f1, f2, f3, f4) uniquely identify observations
dfUnique = expand.grid(f1 = factor(1:16),
f2 = factor(1:41),
f3 = factor(1:2),
f4 = factor(1:104))
# sample some extra rows and rbind them
dfDup = rbind(dfUnique, dfUnique[sample(1:nrow(dfUnique), 100), ])
# dummy data
dfDup$data = rnorm(nrow(dfDup))
# aggregate the duplicate rows by taking the sum
dfDupAgg = ddply(dfDup, .(f1, f2, f3, f4), summarise, data = sum(data))
The second solution is to use data.table
, and following the advice here, I could do
# data.table solution
indexVars = paste0('f', 1:4, sep = '')
dtDup = data.table(dfDup, key = indexVars)
dtDupAgg = dtDup[, list(data = sum(data)), by = key(dtDup)]
I have a couple of questions:
1. Is there a way to make the ddply
version faster?
2. Is the data.table
correct? I want to check since I am new to data.table
.
Regarding your data.table
solution, you don't need to set key for aggregation operations. You can directly do:
indexVars = paste0('f', 1:4, sep = '')
dtDup <- as.data.table(dfDup) ## faster than data.table(.)
dtDupAgg = dtDup[, list(data = sum(data)), by = c(indexVars)]
data.table
version 1.9.2+ also implements a function setDT
that enables conversion of data.frames
to data.tables
by reference (which means, there is no copy and therefore takes almost no time in the conversion, especially useful on large data.frames).
So, instead of doing:
dtDup <- as.data.table(dfDup)
dtDup[...]
You could do:
## data.table v1.9.2+
setDT(dfDup) ## faster than as.data.table(.)
dfDup[...] ## dfDup is now a data.table, converted by reference
On your first question, plyr
is not known for its speed. Check Why is plyr so slow? (and the many informative comments there) for more info.
Perhaps you maybe interested in dplyr
, which is orders of magnitude faster than plyr
, but still slower than data.table
, IMHO. Here's the equivalent dplyr
version:
dfDup %.% group_by(f1, f2, f3, f4) %.% summarise(data = sum(data))
Here's a benchmark between data.table
and dplyr
on the data (all timings are minimum of three consecutive runs):
## data.table v1.9.2+
system.time(ans1 <- dtDup[, list(data=sum(data)), by=c(indexVars)])
# user system elapsed
# 0.049 0.009 0.057
## dplyr (commit ~1360 from github)
system.time(ans2 <- dfDup %.% group_by(f1, f2, f3, f4) %.% summarise(data = sum(data)))
# user system elapsed
# 0.374 0.013 0.389
I really don't have the patience to run the plyr
version (stopped after 93 seconds of first run). As you can see dplyr
is much faster than plyr
, but ~7x times slower than data.table
here.
Check if the results are equal to be sure:
all.equal(as.data.frame(ans1[order(f1,f2,f3,f4)]),
as.data.frame(ans2))
# [1] TRUE
HTH
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With