While experimenting with aggregate
for another question here, I encountered a rather strange result. I'm unable to figure out why and am wondering if what I'm doing is totally wrong.
Suppose, I have a data.frame
like this:
df <- structure(list(V1 = c(1L, 2L, 1L, 2L, 3L, 1L),
V2 = c(2L, 3L, 2L, 3L, 4L, 2L),
V3 = c(3L, 4L, 3L, 4L, 5L, 3L),
V4 = c(4L, 5L, 4L, 5L, 6L, 4L)),
.Names = c("V1", "V2", "V3", "V4"),
row.names = c(NA, -6L), class = "data.frame")
> df
# V1 V2 V3 V4
# 1 1 2 3 4
# 2 2 3 4 5
# 3 1 2 3 4
# 4 2 3 4 5
# 5 3 4 5 6
# 6 1 2 3 4
Now, if I want to output a data.frame
with unique rows with an additional column indicating their frequency in df
. For this example,
# V1 V2 V3 V4 x
# 1 1 2 3 4 3
# 2 2 3 4 5 2
# 3 3 4 5 6 1
I obtained this output using aggregate
by experimenting as follows:
> aggregate(do.call(paste, df), by=df, print)
# [1] "1 2 3 4" "1 2 3 4" "1 2 3 4"
# [1] "2 3 4 5" "2 3 4 5"
# [1] "3 4 5 6"
# V1 V2 V3 V4 x
# 1 1 2 3 4 1 2 3 4, 1 2 3 4, 1 2 3 4
# 2 2 3 4 5 2 3 4 5, 2 3 4 5
# 3 3 4 5 6 3 4 5 6
So, this gave me the pasted string. So, if I were to use length
instead of print
, it should give me the number of such occurrences, which is the desired result, which was the case (as shown below).
> aggregate(do.call(paste, df), by=df, length)
# V1 V2 V3 V4 x
# 1 1 2 3 4 3
# 2 2 3 4 5 2
# 3 3 4 5 6 1
And this seemed to work. However, when the data.frame
dimensions are 4*2500, the output data.frame
is 1*2501 instead of 4*2501 (all rows are unique, so the frequency is 1).
> df <- as.data.frame(matrix(sample(1:3, 1e4, replace = TRUE), nrow=4))
> o <- aggregate(do.call(paste, df), by=df, length)
> dim(o)
# [1] 1 2501
I tested with smaller data.frames with just unique rows and it gives the right output (change nrow=40
, for example). However, when the dimensions of the matrix increase, this doesn't seem to work. And I just can't figure out what's going wrong! Any ideas?
The issue here is how aggregate.data.frame()
determines the groups.
In aggregate.data.frame()
there is a loop which forms the grouping variable grp
. In that loop, grp
is altered/updated via:
grp <- grp * nlevels(ind) + (as.integer(ind) - 1L)
The problem with your example if that once by
is converted to factors, and the loop has gone over all of these factors, in your example grp
ends up being:
Browse[2]> grp
[1] Inf Inf Inf Inf
Essentially the looping update pushed the values of grp
to a number indistinguishable from Inf
.
Having done that, aggregate.data.frame()
later does this
y <- y[match(sort(unique(grp)), grp, 0L), , drop = FALSE]
and this is where the earlier problem now manifests itself as
dim(y[match(sort(unique(grp)), grp, 0L), , drop = FALSE])
because
match(sort(unique(grp)), grp, 0L)
clearly returns just 1
:
> match(sort(unique(grp)), grp, 0L)
[1] 1
as there is only one unique value of grp
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With