I have this problem with <code>data.table</code> which is driving me crazy recently. It looks like a bug but may be I'm missing something obvious here. I have the following data frame: <pre class="prettyprint"><code># First some data data <- data.table(structure(list( month = structure(c(1356998400, 1356998400, 1356998400, 1359676800, 1354320000, 1359676800, 1359676800, 1356998400, 1356998400, 1354320000, 1354320000, 1354320000, 1359676800, 1359676800, 1359676800, 1356998400, 1359676800, 1359676800, 1356998400, 1359676800, 1359676800, 1359676800, 1359676800, 1354320000, 1354320000), class = c("POSIXct", "POSIXt"), tzone = "UTC"), portal = c(TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE ), satisfaction = c(10L, 10L, 10L, 9L, 10L, 10L, 9L, 10L, 10L, 9L, 2L, 8L, 10L, 9L, 10L, 10L, 9L, 10L, 10L, 10L, 9L, 10L, 9L, 10L, 10L)), .Names = c("month", "portal", "satisfaction"), row.names = c(NA, -25L), class = "data.frame")) </code></pre> I want to summarize it by both <code>portal</code> and <code>month</code>. Summarizing with good old <code>tapply</code> works as expected - I get 3x2 matrix with results for Dec-2012 and Jan-Feb 2013: <pre class="prettyprint"><code>> tapply(data$satisfaction, list(data$month, data$portal), mean) FALSE TRUE 2012-12-01 8.5 8.000000 2013-01-01 10.0 10.000000 2013-02-01 9.0 9.545455 </code></pre> Summarizing with <code>by</code> argument of <code>data.table</code> does not: <pre class="prettyprint"><code>> data[, mean(satisfaction), by = 'month,portal'] month portal V1 1: 2013-01-01 FALSE 10.000000 2: 2013-02-01 TRUE 9.000000 3: 2013-01-01 TRUE 10.000000 4: 2012-12-01 FALSE 8.500000 5: 2012-12-01 TRUE 7.333333 6: 2013-02-01 TRUE 9.666667 7: 2013-02-01 FALSE 9.000000 8: 2012-12-01 TRUE 10.000000 </code></pre> As you see, it returns a data table with 8 values, not 6 as expected; values where <code>portal == TRUE</code> and <code>month == 2012-02-01</code> are duplicated, for example. Interestingly enough, if I limit this just to 2013's data, everything works as expected: <pre class="prettyprint"><code>> data[month >= ymd(20130101), mean(satisfaction), by = 'month,portal'] month portal V1 1: 2013-01-01 TRUE 10.000000 2: 2013-01-01 FALSE 10.000000 3: 2013-02-01 TRUE 9.545455 4: 2013-02-01 FALSE 9.000000 </code></pre> I am puzzled beyond believe :). Could somebody please help me?

That is a known issue that was solved in data.table 1.8.7 (not in CRAN yet as of this writing). From data.table NEWS: <blockquote> <pre class="prettyprint"><code>BUG FIXES <...> o setkey could sort 'double' columns (such as POSIXct) incorrectly when not the last column of the key, #2484. In data.table's C code : x[a] > x[b]-tol should have been : x[a]-x[b] > -tol [or x[b]-x[a] < tol ] The difference may have been machine/compiler dependent. Many thanks to statquant for the short reproducible example. Test added. </code></pre> </blockquote> After updating to 1.8.7 with <code>install.packages("data.table", repos="http://R-Forge.R-project.org")</code>, everything works as expected.

data.table NOT summarizing properly by two columns

Tags:

r

data.table

I have this problem with data.table which is driving me crazy recently. It looks like a bug but may be I'm missing something obvious here.

I have the following data frame:

# First some data
data <- data.table(structure(list(
  month = structure(c(1356998400, 1356998400, 1356998400, 
                      1359676800, 1354320000, 1359676800, 1359676800, 1356998400, 1356998400, 
                      1354320000, 1354320000, 1354320000, 1359676800, 1359676800, 1359676800, 
                      1356998400, 1359676800, 1359676800, 1356998400, 1359676800, 1359676800, 
                      1359676800, 1359676800, 1354320000, 1354320000), class = c("POSIXct", 
                                                                                 "POSIXt"), tzone = "UTC"), 
  portal = c(TRUE, TRUE, FALSE, TRUE, 
             TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, 
             TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE
  ), 
  satisfaction = c(10L, 10L, 10L, 9L, 10L, 10L, 9L, 10L, 10L, 
                   9L, 2L, 8L, 10L, 9L, 10L, 10L, 9L, 10L, 10L, 10L, 9L, 10L, 9L, 
                   10L, 10L)), 
                  .Names = c("month", "portal", "satisfaction"), 
                  row.names = c(NA, -25L), class = "data.frame"))

I want to summarize it by both portal and month. Summarizing with good old tapply works as expected - I get 3x2 matrix with results for Dec-2012 and Jan-Feb 2013:

> tapply(data$satisfaction, list(data$month, data$portal), mean)
           FALSE      TRUE
2012-12-01   8.5  8.000000
2013-01-01  10.0 10.000000
2013-02-01   9.0  9.545455

Summarizing with by argument of data.table does not:

> data[, mean(satisfaction), by = 'month,portal']
   month      portal        V1
1: 2013-01-01  FALSE 10.000000
2: 2013-02-01   TRUE  9.000000
3: 2013-01-01   TRUE 10.000000
4: 2012-12-01  FALSE  8.500000
5: 2012-12-01   TRUE  7.333333
6: 2013-02-01   TRUE  9.666667
7: 2013-02-01  FALSE  9.000000
8: 2012-12-01   TRUE 10.000000

As you see, it returns a data table with 8 values, not 6 as expected; values where portal == TRUE and month == 2012-02-01 are duplicated, for example.

Interestingly enough, if I limit this just to 2013's data, everything works as expected:

> data[month >= ymd(20130101), mean(satisfaction), by = 'month,portal']
        month portal        V1
1: 2013-01-01   TRUE 10.000000
2: 2013-01-01  FALSE 10.000000
3: 2013-02-01   TRUE  9.545455
4: 2013-02-01  FALSE  9.000000

I am puzzled beyond believe :). Could somebody please help me?

493

asked Feb 25 '13 21:02

Victor K.

2 Answers

That is a known issue that was solved in data.table 1.8.7 (not in CRAN yet as of this writing).

From data.table NEWS:

BUG FIXES

    <...>

o   setkey could sort 'double' columns (such as POSIXct) incorrectly when not the
    last column of the key, #2484. In data.table's C code :
        x[a] > x[b]-tol
    should have been :
        x[a]-x[b] > -tol  [or  x[b]-x[a] < tol ]
    The difference may have been machine/compiler dependent. Many thanks to statquant
    for the short reproducible example. Test added.

After updating to 1.8.7 with install.packages("data.table", repos="http://R-Forge.R-project.org"), everything works as expected.

answered Sep 23 '22 04:09

Victor K.

The problem seems to be with the sorting. When I load data and do setkey:

setkey(data, "month", "portal")

# > data
#          month portal satisfaction
#  1: 2012-12-01   TRUE           10
#  2: 2012-12-01  FALSE            9
#  3: 2012-12-01  FALSE            8
#  4: 2012-12-01   TRUE            2
#  5: 2012-12-01   TRUE           10
#  6: 2012-12-01   TRUE           10
#  7: 2013-01-01   TRUE           10
#  8: 2013-01-01   TRUE           10
#  9: 2013-01-01   TRUE           10
# 10: 2013-01-01   TRUE           10
# 11: 2013-01-01   TRUE           10
# 12: 2013-01-01   TRUE           10
# 13: 2013-01-01  FALSE           10
# 14: 2013-02-01   TRUE            9
# 15: 2013-02-01   TRUE            9
# 16: 2013-02-01  FALSE            9
# 17: 2013-02-01   TRUE           10
# 18: 2013-02-01   TRUE           10
# 19: 2013-02-01   TRUE           10
# 20: 2013-02-01   TRUE           10
# 21: 2013-02-01   TRUE           10
# 22: 2013-02-01   TRUE            9
# 23: 2013-02-01   TRUE           10
# 24: 2013-02-01   TRUE            9
# 25: 2013-02-01   TRUE            9
#          month portal satisfaction

You see that the portal column is not properly sorted. When I do setkey again,

setkey(data, "month", "portal")

# I get this warning message:
Warning message:
In setkeyv(x, cols, verbose = verbose) :
  Already keyed by this key but had invalid row order, key rebuilt. 
  If you didn't go under the hood please let datatable-help know so 
  the root cause can be fixed.

Now, the data columns seem properly sorted by key columns:

# > data
#          month portal satisfaction
#  1: 2012-12-01  FALSE            9
#  2: 2012-12-01  FALSE            8
#  3: 2012-12-01   TRUE           10
#  4: 2012-12-01   TRUE            2
#  5: 2012-12-01   TRUE           10
#  6: 2012-12-01   TRUE           10
#  7: 2013-01-01  FALSE           10
#  8: 2013-01-01   TRUE           10
#  9: 2013-01-01   TRUE           10
# 10: 2013-01-01   TRUE           10
# 11: 2013-01-01   TRUE           10
# 12: 2013-01-01   TRUE           10
# 13: 2013-01-01   TRUE           10
# 14: 2013-02-01  FALSE            9
# 15: 2013-02-01   TRUE            9
# 16: 2013-02-01   TRUE            9
# 17: 2013-02-01   TRUE           10
# 18: 2013-02-01   TRUE           10
# 19: 2013-02-01   TRUE           10
# 20: 2013-02-01   TRUE           10
# 21: 2013-02-01   TRUE           10
# 22: 2013-02-01   TRUE            9
# 23: 2013-02-01   TRUE           10
# 24: 2013-02-01   TRUE            9
# 25: 2013-02-01   TRUE            9
#          month portal satisfaction

So, it seems to be an issue with sortting a POSIXct + logical types?

data[, mean(satisfaction), by=list(month, portal)]

#         month portal        V1
# 1: 2012-12-01  FALSE  8.500000
# 2: 2012-12-01   TRUE  8.000000
# 3: 2013-01-01  FALSE 10.000000
# 4: 2013-01-01   TRUE 10.000000
# 5: 2013-02-01  FALSE  9.000000
# 6: 2013-02-01   TRUE  9.545455

I therefore think there's a bug.

answered Sep 21 '22 04:09

Arun

Related questions
                            
                                How does the subset argument work in the lm() function?
                            
                                How to set the "zero" oritentation in ggplot2 of R?
                            
                                ggplot2: highlight chart area
                            
                                How to elegantly convert datetime from decimal to "%d.%m.%y %H:%M:%S"?
                            
                                Create "arty" mosaic pictures with R (*not* statistical mosaic plots)
                            
                                How to automatically shrink down row numbers in R data frame when removing rows in R
                            
                                How to mine for motifs in R with iGraph
                            
                                Plotting neural network model from nnet package R cran
                            
                                How to Increase the thickness of the box lines in an R boxplot?
                            
                                Count how many times the element is repeated in a sequence (in R)
                            
                                Pairwise interaction matrix in R
                            
                                Add greek letters to axis tick labels in R
                            
                                Reshaping data to plot in R using ggplot2
                            
                                How to calculate difference from initial value for each group in R?
                            
                                R style data-axis buffer in matplotlib
                            
                                How to perform single factor ANOVA in R with samples organized by column?
                            
                                fast join data.table (potential bug, checking before reporting)
                            
                                R CMD BATCH - output in terminal
                            
                                Sampling from a Binomial(K, p) with unexpected result
                            
                                How to perform a vector overlay of two SpatialPolygonsDataFrame objects?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

data.table NOT summarizing properly by two columns

Tags:

r

data.table

Victor K.

People also ask

2 Answers

Victor K.

Arun

Recent Activity

Donate For Us