I use ddply quite a bit but I do not consider myself an expert. I have a data frame (df) with grouping variable "Group" which has values of "A", "B" and "C" and the variable to summarize, "Var" has numeric values. If I use
ddply(df, .(Group), summarize, mysum=sum(Var))
then I get the sum of each A, B and C, which is correct. But what I want to do is to sum over each grouping of the Group variables as they are arranged in the data frame. For instance, if the data frame has
Group Var
A 1.3
A 1.2
A 0.4
B 0.3
B 1.3
C 1.5
C 1.7
C 1.9
A 2.1
A 2.4
B 6.7
The Desired result
A 2.9
B 1.6
C 5.1
A 4.5
B 6.7
So, the desired output performs a mathematical function on each grouping of the Group variables, rather than on all instances of the individual Group variables. Can this be done in ddply?
Data
dat <- structure(list(Group = c("A", "A", "A", "B", "B", "C", "C", "C", "A", "A", "B"),
Var = c(1.3, 1.2, 0.4, 0.3, 1.3, 1.5, 1.7, 1.9, 2.1, 2.4, 6.7)),
.Names = c("Group", "Var"), class = "data.frame", row.names = c(NA, -11L))
Here's one way of doing this using the recently implemented rleid()
function from data.table
v1.9.6. See #686.
This generates the grouping ids as required:
require(data.table) ## v1.9.6+
DT = as.data.table(dat)
rleid(DT$Group)
# [1] 1 1 1 2 2 3 3 3 4 4 5
We can use this directly to aggregate as follows:
DT[, .(sum=sum(Var)), by=.(Group, rleid(Group))]
# Group rleid sum
# 1: A 1 2.9
# 2: B 2 1.6
# 3: C 3 5.1
# 4: A 4 4.5
# 5: B 5 6.7
HTH
Here would be the base equivalent
dat <- structure(list(Group = c("A", "A", "A", "B", "B", "C", "C", "C", "A", "A", "B"),
Var = c(1.3, 1.2, 0.4, 0.3, 1.3, 1.5, 1.7, 1.9, 2.1, 2.4, 6.7)),
.Names = c("Group", "Var"), class = "data.frame", row.names = c(NA, -11L))
with(dat, cumsum(c(1L, Group[-length(Group)] != Group[-1])))
# [1] 1 1 1 2 2 3 3 3 4 4 5
As a function
rleid <- function(x) cumsum(c(1L, x[-length(x)] != x[-1]))
(dat <- within(dat, id <- rleid(Group)))
# Group Var id
# 1 A 1.3 1
# 2 A 1.2 1
# 3 A 0.4 1
# 4 B 0.3 2
# 5 B 1.3 2
# 6 C 1.5 3
# 7 C 1.7 3
# 8 C 1.9 3
# 9 A 2.1 4
# 10 A 2.4 4
# 11 B 6.7 5
aggregate
based on the new variable
aggregate(Var ~ ., dat, sum)
# Group id Var
# 1 A 1 2.9
# 2 B 2 1.6
# 3 C 3 5.1
# 4 A 4 4.5
# 5 B 5 6.7
Alternatively, you can actually use rle
, but it requires an atomic vector, so if you are using a factor then you need an extra step (ie, as.vector
)
rleid2 <- function(x) {
x <- as.vector(x)
rep(seq_along(rle(x)$values), rle(x)$lengths)
}
rleid2(dat$Group)
# [1] 1 1 1 2 2 3 3 3 4 4 5
Some benchmarks:
set.seed(1)
dat2 <- dat[sample(1:nrow(dat), 1e6, TRUE), ]
identical(data.table::rleid(dat2$Group),
rleid(dat2$Group))
# [1] TRUE
library('microbenchmark')
microbenchmark(data.table::rleid(dat2$Group),
rleid(dat2$Group),
rleid2(dat2$Group), unit = 'relative')
# Unit: relative
# expr min lq mean median uq max neval cld
# data.table::rleid(dat2$Group) 1.032777 1.015395 1.005023 1.020923 1.000612 0.8935531 100 a
# rleid(dat2$Group) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100 a
# rleid2(dat2$Group) 35.747987 35.351585 28.600030 34.058992 33.147546 9.8786083 100 b
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With