The data comes from another question I was playing around with:
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
country=c(rep(1,4),rep(2,6)),
event=1:10, key="user")
# user country event
#1: 3 1 1
#2: 3 1 2
#3: 3 1 3
#4: 3 1 4
#5: 3 2 5
#6: 4 2 6
#7: 4 2 7
#8: 4 2 8
#9: 4 2 9
#10: 4 2 10
And here's the surprising behavior:
dt[user == 3, as.data.frame(table(country))]
# country Freq
#1 1 4
#2 2 1
dt[user == 4, as.data.frame(table(country))]
# country Freq
#1 2 5
dt[, as.data.frame(table(country)), by = user]
# user country Freq
#1: 3 1 4
#2: 3 2 1
#3: 4 1 5
# ^^^ - why is this 1 instead of 2?!
Thanks mnel and Victor K. The natural follow-up is - shouldn't it be 2, i.e. is this a bug? I expected
dt[, blah, by = user]
to return identical result to
rbind(dt[user == 3, blah], dt[user == 4, blah])
Is that expectation incorrect?
A data table is one type of graphic organizer used frequently in science. It is used especially during laboratory experiments when qualitative and/or quantitative data are collected. Data tables are not randomly constructed; they have at least two columns or rows and specific data entered into each column/row.
A data table is a document composed of columns, rows and cells that contain specific values. They store information that people can retrieve later and update as needed. The data table title, column headers and row headers can help a user understand the information in the table more clearly.
The idiomatic data.table approach is to use .N
dt[ , .N, by = list(user, country)]
This will be far quicker and it will also retain country as the same class as in the original.
As mnel
noted in comments, as.data.frame(table(...))
produces a data frame where the first variable is a factor. For user == 4
, there is only one level in the factor, which is stored internally as 1.
What you want is factor levels, but what you get is how factors are stored internally (as integers, starting from 1). The following provides the expected result:
> dt[, lapply(as.data.frame(table(country)), as.character), by = user]
user country Freq
1: 3 1 4
2: 3 2 1
3: 4 2 5
Update. Regarding your second question: no, I think data.table
behaviour is correct. Same thing happens in plain R when you join two factors with different levels:
> a <- factor(3:5)
> b <- factor(6:8)
> a
[1] 3 4 5
Levels: 3 4 5
> b
[1] 6 7 8
Levels: 6 7 8
> c(a,b)
[1] 1 2 3 1 2 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With