Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table and table unexpected behavior

Tags:

r

data.table

The data comes from another question I was playing around with:

dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
                 country=c(rep(1,4),rep(2,6)),
                 event=1:10, key="user")
#    user country event
#1:     3       1     1
#2:     3       1     2
#3:     3       1     3
#4:     3       1     4
#5:     3       2     5
#6:     4       2     6
#7:     4       2     7
#8:     4       2     8
#9:     4       2     9
#10:    4       2    10

And here's the surprising behavior:

dt[user == 3, as.data.frame(table(country))]
#  country Freq
#1       1    4
#2       2    1

dt[user == 4, as.data.frame(table(country))]
#  country Freq
#1       2    5

dt[, as.data.frame(table(country)), by = user]
#   user country Freq
#1:    3       1    4
#2:    3       2    1
#3:    4       1    5
#             ^^^ - why is this 1 instead of 2?!

Thanks mnel and Victor K. The natural follow-up is - shouldn't it be 2, i.e. is this a bug? I expected

dt[, blah, by = user]

to return identical result to

rbind(dt[user == 3, blah], dt[user == 4, blah])

Is that expectation incorrect?

like image 615
eddi Avatar asked Apr 24 '13 20:04

eddi


People also ask

What is a data table in an experiment?

A data table is one type of graphic organizer used frequently in science. It is used especially during laboratory experiments when qualitative and/or quantitative data are collected. Data tables are not randomly constructed; they have at least two columns or rows and specific data entered into each column/row.

What should a data table include?

A data table is a document composed of columns, rows and cells that contain specific values. They store information that people can retrieve later and update as needed. The data table title, column headers and row headers can help a user understand the information in the table more clearly.


2 Answers

The idiomatic data.table approach is to use .N

 dt[ , .N, by = list(user, country)]

This will be far quicker and it will also retain country as the same class as in the original.

like image 89
mnel Avatar answered Oct 06 '22 00:10

mnel


As mnel noted in comments, as.data.frame(table(...)) produces a data frame where the first variable is a factor. For user == 4, there is only one level in the factor, which is stored internally as 1.

What you want is factor levels, but what you get is how factors are stored internally (as integers, starting from 1). The following provides the expected result:

> dt[, lapply(as.data.frame(table(country)), as.character), by = user]
   user country Freq
1:    3       1    4
2:    3       2    1
3:    4       2    5

Update. Regarding your second question: no, I think data.table behaviour is correct. Same thing happens in plain R when you join two factors with different levels:

> a <- factor(3:5)
> b <- factor(6:8)
> a
[1] 3 4 5
Levels: 3 4 5
> b
[1] 6 7 8
Levels: 6 7 8
> c(a,b)
[1] 1 2 3 1 2 3
like image 27
Victor K. Avatar answered Oct 05 '22 23:10

Victor K.