data.table and table unexpected behavior

Tags:

data.table

The data comes from another question I was playing around with:

dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
                 country=c(rep(1,4),rep(2,6)),
                 event=1:10, key="user")
#    user country event
#1:     3       1     1
#2:     3       1     2
#3:     3       1     3
#4:     3       1     4
#5:     3       2     5
#6:     4       2     6
#7:     4       2     7
#8:     4       2     8
#9:     4       2     9
#10:    4       2    10

And here's the surprising behavior:

dt[user == 3, as.data.frame(table(country))]
#  country Freq
#1       1    4
#2       2    1

dt[user == 4, as.data.frame(table(country))]
#  country Freq
#1       2    5

dt[, as.data.frame(table(country)), by = user]
#   user country Freq
#1:    3       1    4
#2:    3       2    1
#3:    4       1    5
#             ^^^ - why is this 1 instead of 2?!

Thanks mnel and Victor K. The natural follow-up is - shouldn't it be 2, i.e. is this a bug? I expected

dt[, blah, by = user]

to return identical result to

rbind(dt[user == 3, blah], dt[user == 4, blah])

Is that expectation incorrect?

615

asked Apr 24 '13 20:04

2 Answers

The idiomatic data.table approach is to use .N

 dt[ , .N, by = list(user, country)]

This will be far quicker and it will also retain country as the same class as in the original.

answered Oct 06 '22 00:10

As mnel noted in comments, as.data.frame(table(...)) produces a data frame where the first variable is a factor. For user == 4, there is only one level in the factor, which is stored internally as 1.

What you want is factor levels, but what you get is how factors are stored internally (as integers, starting from 1). The following provides the expected result:

> dt[, lapply(as.data.frame(table(country)), as.character), by = user]
   user country Freq
1:    3       1    4
2:    3       2    1
3:    4       2    5

Update. Regarding your second question: no, I think data.table behaviour is correct. Same thing happens in plain R when you join two factors with different levels:

> a <- factor(3:5)
> b <- factor(6:8)
> a
[1] 3 4 5
Levels: 3 4 5
> b
[1] 6 7 8
Levels: 6 7 8
> c(a,b)
[1] 1 2 3 1 2 3

answered Oct 05 '22 23:10

Victor K.

Related questions
                            
                                Having problems saving a neural net plot using neuralnet package - R
                            
                                Find max per group and return another column
                            
                                Splitting irregular time series into regular monthly averages - R
                            
                                multiple graphs pdf R
                            
                                How can I lapply to sub element of list within a list
                            
                                how to transpose a matrix in r if the usual `t( )` doesn't work?
                            
                                Find first greater element with higher index
                            
                                Combined line & bar geoms: How to generate proper legend?
                            
                                Summarize based on two grouping variables in R using data.table
                            
                                subsetting in xts using a parameter holding dates
                            
                                Draw a quadratic spline through points in lattice
                            
                                Scatterplot with single regression line despite two groups using ggplot2
                            
                                inserting stargazer or xable table into knitr document
                            
                                How to make groups in a data.frame equal length?
                            
                                R date time aligning and fill through values
                            
                                How do I make a heatmap-style bivariate histogram in a lattice layout?
                            
                                How to manipulate y-axis text labels in R varImpPlot?
                            
                                Multiple roots in the complex plane with R
                            
                                Is it possible to install pandoc on windows using an R command?
                            
                                Create columns from column of list in data.table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

data.table and table unexpected behavior

Tags:

r

data.table

eddi

People also ask

2 Answers

mnel

Victor K.

Recent Activity

Donate For Us