Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use data.table as super class in S4

Tags:

r

data.table

s4

In the R-Package data.table the manual entry for ?data.table-class says that 'data.table' can be used for inheritance in a class definition, i.e. in the contains argument in a call to setClass:

library("data.table")
setClass("Data.Table", contains = "data.table")

However, if I create an instance of a Data.Table I would have expected that I can treat it like a data.table. This is not so. The following snippet will result in an error, which, as far as I understand, is because the [.data.table function can not handle the mix of S3 and S4 dispatch:

dat <- new("Data.Table", data.table(x = 1))
dat[TRUE]

I solved this, by defining a new method for [ and coercing any Data.Table to a data.table before evaluating it therein.

setMethod(
  "[", 
  "Data.Table", 
  function(x, i, j, ..., drop = TRUE) {
    mc <- match.call()
    mc$x <- substitute(S3Part(x, strictS3 = TRUE))
    Data.Table(
      eval(mc, envir = parent.frame())
    )
  })

And a constructor function to feel more comfortable with it:

Data.Table <- function(...) new("Data.Table", data.table(...))
dat <- Data.Table(x = 1, key = "x")
dat[1]

This is acceptable for some scenarios but I loose all get and set functions from the data.table package and I suspect that I destroyed some other features. So the question is how to implement a working S4 data.table class? I would appreciate

  1. Pointers to similar attempts/projects
  2. Better/alternative solutions/ideas for an implementation
  3. Any advice on what I loose with respect to performance with the above solution

There is one related question on SO I found, which presents a similar approach. However, I think it would involve too much coding to be feasible.

like image 912
Sebastian Avatar asked Aug 25 '15 14:08

Sebastian


1 Answers

I think the short answer (the problem is still as valid as it was when raised) is that using data.table as a super class in S4 is not recommendable and not possible without considerable amount of effort and certain risks of instability.

It is also not quite clear what the goal should have been with the case at hand, but let's assume there was no alternative like forking and modifying the existing data.table package.

Then, to illustrate the case mentioned above with the [, let's first initialize the example:

# replicating some code from above
library("data.table")
Data.Table <- setClass("Data.Table", contains = "data.table")

dat <- Data.Table(data.table(x = 1))
dat[1]
> Error in if (n > 0) c(NA_integer_, -n) else integer() : 
    argument is of length zero

dat2 <- data.table(x = 1)

Now to check [.data.table, which is a lot of code as you can see on the Github repo data.table.R, so just reproducing the relevant part in the simplest dummy way:

# initializing output
ans = vector("list", 1)
# data (just one line of code as we have just one value in our example).
# desired subscript is row 1, but we have just one column as well.
ans[[1]] <- dat[[1]][1]
# add 'names' attribute
setattr(ans, "names", "x")
# set 'class' attribute
setattr(ans, "class", class(dat))
# set 'row.names'
setattr(ans, "row.names", .set_row_names(nrow(ans)))

And there we have the error, trying to set the row.names, which doesn't work because dim(ans) and therefore nrow is NULL.

So the real problem is here with the usage of setattr(ans, "class", class(dat)), which doesn't work well (try isS4(ans) or print(ans) just afterwards). In fact, from ?class we can read about S4:

The replacement version of the function sets the class to the value provided. For classes that have a formal definition, directly replacing the class this way is strongly deprecated. The expression as(object, value) is the way to coerce an object to a particular class.

data.table's setattr, which through C uses R's setAttrib function, is similar to calling attr(ans, "class") <- "Data.Table" or class(ans) <- "Data.Table", which would screw up as well.

If you do setattr(ans, "class", class(dat2)) instead, you will see that everything is fine here, as should be with S3. One more word of caution though:

setattr(ans, "class", "data.frame")

and then print(ans) or dim(ans) may not look very nice to you... (although ans$x is ok).


Overriding setattr() in a good way isn't trivial either and such an approach will probably not get you any farther than the approach you have outlined above. Result could be something like:

setattr_new <- function(x, name, value) {
  if (name == "class" && "Data.Table" %in% value) {
    value <- c("data.table", "data.frame")
  }
  if (name == "names" && is.data.table(x) && length(attr(x, "names")) && !is.null(value))
    setnames(x, value)
  else {
    ans = .Call(Csetattrib, x, name, value)
    if (!is.null(ans)) {
      warning("Input is a length=1 logical that points to the same address as R's global TRUE value. Therefore the attribute has not been set by reference, rather on a copy. You will need to assign the result back to a variable. See https://github.com/Rdatatable/data.table/issues/1281 for more.")
      x = ans
    }
  }
  if (name == "levels" && is.factor(x) && anyDuplicated(value)) 
    .Call(Csetlevels, x, (value <- as.character(value)), unique(value))
  invisible(x)
}

godmode:::assignAnywhere("setattr", setattr_new)

identical(dat[1], dat2[1])
[1] TRUE

# then possibly convert back to S4 class if desired for further processing at the end
as(dat[1], "Data.Table")
like image 93
RolandASc Avatar answered Oct 16 '22 05:10

RolandASc