I have this problem with data.table
which is driving me crazy recently. It looks like a bug but may be I'm missing something obvious here.
I have the following data frame:
# First some data
data <- data.table(structure(list(
month = structure(c(1356998400, 1356998400, 1356998400,
1359676800, 1354320000, 1359676800, 1359676800, 1356998400, 1356998400,
1354320000, 1354320000, 1354320000, 1359676800, 1359676800, 1359676800,
1356998400, 1359676800, 1359676800, 1356998400, 1359676800, 1359676800,
1359676800, 1359676800, 1354320000, 1354320000), class = c("POSIXct",
"POSIXt"), tzone = "UTC"),
portal = c(TRUE, TRUE, FALSE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE
),
satisfaction = c(10L, 10L, 10L, 9L, 10L, 10L, 9L, 10L, 10L,
9L, 2L, 8L, 10L, 9L, 10L, 10L, 9L, 10L, 10L, 10L, 9L, 10L, 9L,
10L, 10L)),
.Names = c("month", "portal", "satisfaction"),
row.names = c(NA, -25L), class = "data.frame"))
I want to summarize it by both portal
and month
. Summarizing with good old tapply
works as expected - I get 3x2 matrix with results for Dec-2012 and Jan-Feb 2013:
> tapply(data$satisfaction, list(data$month, data$portal), mean)
FALSE TRUE
2012-12-01 8.5 8.000000
2013-01-01 10.0 10.000000
2013-02-01 9.0 9.545455
Summarizing with by
argument of data.table
does not:
> data[, mean(satisfaction), by = 'month,portal']
month portal V1
1: 2013-01-01 FALSE 10.000000
2: 2013-02-01 TRUE 9.000000
3: 2013-01-01 TRUE 10.000000
4: 2012-12-01 FALSE 8.500000
5: 2012-12-01 TRUE 7.333333
6: 2013-02-01 TRUE 9.666667
7: 2013-02-01 FALSE 9.000000
8: 2012-12-01 TRUE 10.000000
As you see, it returns a data table with 8 values, not 6 as expected; values where portal == TRUE
and month == 2012-02-01
are duplicated, for example.
Interestingly enough, if I limit this just to 2013's data, everything works as expected:
> data[month >= ymd(20130101), mean(satisfaction), by = 'month,portal']
month portal V1
1: 2013-01-01 TRUE 10.000000
2: 2013-01-01 FALSE 10.000000
3: 2013-02-01 TRUE 9.545455
4: 2013-02-01 FALSE 9.000000
I am puzzled beyond believe :). Could somebody please help me?
The easiest way to create summary tables in R is to use the describe() and describeBy() functions from the psych library. The following examples show how to use these functions in practice.
That is a known issue that was solved in data.table 1.8.7 (not in CRAN yet as of this writing).
From data.table NEWS:
BUG FIXES <...> o setkey could sort 'double' columns (such as POSIXct) incorrectly when not the last column of the key, #2484. In data.table's C code : x[a] > x[b]-tol should have been : x[a]-x[b] > -tol [or x[b]-x[a] < tol ] The difference may have been machine/compiler dependent. Many thanks to statquant for the short reproducible example. Test added.
After updating to 1.8.7 with install.packages("data.table", repos="http://R-Forge.R-project.org")
, everything works as expected.
The problem seems to be with the sorting. When I load data
and do setkey
:
setkey(data, "month", "portal")
# > data
# month portal satisfaction
# 1: 2012-12-01 TRUE 10
# 2: 2012-12-01 FALSE 9
# 3: 2012-12-01 FALSE 8
# 4: 2012-12-01 TRUE 2
# 5: 2012-12-01 TRUE 10
# 6: 2012-12-01 TRUE 10
# 7: 2013-01-01 TRUE 10
# 8: 2013-01-01 TRUE 10
# 9: 2013-01-01 TRUE 10
# 10: 2013-01-01 TRUE 10
# 11: 2013-01-01 TRUE 10
# 12: 2013-01-01 TRUE 10
# 13: 2013-01-01 FALSE 10
# 14: 2013-02-01 TRUE 9
# 15: 2013-02-01 TRUE 9
# 16: 2013-02-01 FALSE 9
# 17: 2013-02-01 TRUE 10
# 18: 2013-02-01 TRUE 10
# 19: 2013-02-01 TRUE 10
# 20: 2013-02-01 TRUE 10
# 21: 2013-02-01 TRUE 10
# 22: 2013-02-01 TRUE 9
# 23: 2013-02-01 TRUE 10
# 24: 2013-02-01 TRUE 9
# 25: 2013-02-01 TRUE 9
# month portal satisfaction
You see that the portal
column is not properly sorted. When I do setkey
again,
setkey(data, "month", "portal")
# I get this warning message:
Warning message:
In setkeyv(x, cols, verbose = verbose) :
Already keyed by this key but had invalid row order, key rebuilt.
If you didn't go under the hood please let datatable-help know so
the root cause can be fixed.
Now, the data
columns seem properly sorted by key columns:
# > data
# month portal satisfaction
# 1: 2012-12-01 FALSE 9
# 2: 2012-12-01 FALSE 8
# 3: 2012-12-01 TRUE 10
# 4: 2012-12-01 TRUE 2
# 5: 2012-12-01 TRUE 10
# 6: 2012-12-01 TRUE 10
# 7: 2013-01-01 FALSE 10
# 8: 2013-01-01 TRUE 10
# 9: 2013-01-01 TRUE 10
# 10: 2013-01-01 TRUE 10
# 11: 2013-01-01 TRUE 10
# 12: 2013-01-01 TRUE 10
# 13: 2013-01-01 TRUE 10
# 14: 2013-02-01 FALSE 9
# 15: 2013-02-01 TRUE 9
# 16: 2013-02-01 TRUE 9
# 17: 2013-02-01 TRUE 10
# 18: 2013-02-01 TRUE 10
# 19: 2013-02-01 TRUE 10
# 20: 2013-02-01 TRUE 10
# 21: 2013-02-01 TRUE 10
# 22: 2013-02-01 TRUE 9
# 23: 2013-02-01 TRUE 10
# 24: 2013-02-01 TRUE 9
# 25: 2013-02-01 TRUE 9
# month portal satisfaction
So, it seems to be an issue with sortting a POSIXct + logical
types?
data[, mean(satisfaction), by=list(month, portal)]
# month portal V1
# 1: 2012-12-01 FALSE 8.500000
# 2: 2012-12-01 TRUE 8.000000
# 3: 2013-01-01 FALSE 10.000000
# 4: 2013-01-01 TRUE 10.000000
# 5: 2013-02-01 FALSE 9.000000
# 6: 2013-02-01 TRUE 9.545455
I therefore think there's a bug.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With