Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In R data table, max works on an ordered factor and fails when grouped

Tags:

r

data.table

The function max() operates correctly on column of type ordered factor. However, the same operation fails when the column is grouped with by=.

Let's say I have a data.table as:

DT <- data.table(ID=rep(1:3, 3), State=sample(LETTERS[1:3], 9, replace=TRUE))

Convert the column State to ordered factor as:

DT[, State := factor(State, levels=LETTERS[1:3], ordered = TRUE)]

This works:

DT[, max(State)]

This fails with error:

DT[, max(State), by="ID"]

Error is: Error in gmax(State) : max is not meaningful for factors.

How come?

like image 753
Sun Bee Avatar asked Jun 18 '18 19:06

Sun Bee


1 Answers

This was a bug that has been fixed in the current development version of data.table.

You can install the development version via:

install.packages('data.table', type = 'source',
                 repos = 'http://Rdatatable.github.io/data.table')

If this fails, check full details on the Installation wiki.

library(data.table)
# data.table 1.11.5 IN DEVELOPMENT built 2018-08-13 20:20:11 UTC; travis  Latest news: r-datatable.com
DT[ , max(State), by="ID"]
#    ID V1
# 1:  1  C
# 2:  2  C
# 3:  3  B

For those in controlled/production environments unable to update, you can still sidestep the problem by running:

dt_optim = options(datatable.optimize = 0) 
DT[ , max(State), by="ID"]
# resetting afterwards to keep your code running as fast as possible
options(datatable.optimize = dt_optim)

The bug came from data.table's internally optimized grouping framework GForce; the above workaround stops this code from executing and defaults to base::max.

like image 178
MichaelChirico Avatar answered Oct 29 '22 12:10

MichaelChirico