Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to define a class with multiple, optionally-empty slots in S4 of R?

Tags:

oop

r

s4

I am building a package to handle data that arrives with up to 4 different types. Each of these types is a legitimate class in the form of a matrix, data.frame or tree. Depending on the way the data is processed and other experimental factors, some of these data components may be missing, but it is still extremely useful to be able to store this information as an instance of a special class and have methods that recognize the different component data.

Approach 1:

I have experimented with an incremental inheritance structure that looks like a nested tree, where each combination of data types has its own class explicitly defined. This seems difficult to extend for additional data types in the future, and is also challenging for new developers to learn all the class names, however well-organized those names might be.

Approach 2:

A second approach is to create a single "master-class" that includes a slot for all 4 data types. In order to allow the slots to be NULL for the instances of missing data, it appears necessary to first define a virtual class union between the NULL class and the new data type class, and then use the virtual class union as the expected class for the relevant slot in the master-class. Here is an example (assuming each data type class is already defined):

################################################################################
# Use setClassUnion to define the unholy NULL-data union as a virtual class.
################################################################################    
setClassUnion("dataClass1OrNULL", c("dataClass1", "NULL"))
setClassUnion("dataClass2OrNULL", c("dataClass2", "NULL"))
setClassUnion("dataClass3OrNULL", c("dataClass3", "NULL"))
setClassUnion("dataClass4OrNULL", c("dataClass4", "NULL"))
################################################################################
# Now define the master class with all 4 slots, and 
# also the possibility of empty (NULL) slots and an explicity prototype for
# slots to be set to NULL if they are not provided at instantiation.
################################################################################
setClass(Class="theMasterClass", 
    representation=representation(
        slot1="dataClass1OrNULL",
        slot2="dataClass2OrNULL",
        slot3="dataClass3OrNULL",
        slot4="dataClass4OrNULL"),
    prototype=prototype(slot1=NULL, slot2=NULL, slot3=NULL, slot4=NULL)
)
################################################################################

So the question might be rephrased as:

Are there more efficient and/or flexible alternatives to either of these approaches?

This example is modified from an answer to a SO question about setting the default value of slot to NULL. This question differs in that I am interested in knowing the best options in R for creating classes with slots that can be empty if needed, despite requiring a specific complex class in all other non-empty cases.

like image 279
Paul 'Joey' McMurdie Avatar asked Nov 05 '22 10:11

Paul 'Joey' McMurdie


1 Answers

In my opinion...

Approach 2

It sort of defeats the purpose to adopt a formal class system, and then to create a class that contains ill-defined slots ('A' or NULL). At a minimum I would try to make DataClass1 have a 'NULL'-like default. As a simple example, the default here is a zero-length numeric vector.

setClass("DataClass1", representation=representation(x="numeric"))
DataClass1 <- function(x=numeric(), ...) {
    new("DataClass1", x=x, ...)
}

Then

setClass("MasterClass1", representation=representation(dataClass1="DataClass1"))
MasterClass1 <- function(dataClass1=DataClass1(), ...) {
    new("MasterClass1", dataClass1=dataClass1, ...)
}

One benefit of this is that methods don't have to test whether the instance in the slot is NULL or 'DataClass1'

setMethod(length, "DataClass1", function(x) length(x@x))
setMethod(length, "MasterClass1", function(x) length(x@dataClass1))

> length(MasterClass1())
[1] 0
> length(MasterClass1(DataClass1(1:5)))
[1] 5

In response to your comment about warning users when they access 'empty' slots, and remembering that users usually want functions to do something rather than tell them they're doing something wrong, I'd probably return the empty object DataClass1() which accurately reflects the state of the object. Maybe a show method would provide an overview that reinforced the status of the slot -- DataClass1: none. This seems particularly appropriate if MasterClass1 represents a way of coordinating several different analyses, of which the user may do only some.

A limitation of this approach (or your Approach 2) is that you don't get method dispatch -- you can't write methods that are appropriate only for an instance with DataClass1 instances that have non-zero length, and are forced to do some sort of manual dispatch (e.g., with if or switch). This might seem like a limitation for the developer, but it also applies to the user -- the user doesn't get a sense of which operations are uniquely appropriate to instances of MasterClass1 that have non-zero length DataClass1 instances.

Approach 1

When you say that the names of the classes in the hierarchy are going to be confusing to your user, it seems like this is maybe pointing to a more fundamental issue -- you're trying too hard to make a comprehensive representation of data types; a user will never be able to keep track of ClassWithMatrixDataFrameAndTree because it doesn't represent the way they view the data. This is maybe an opportunity to scale back your ambitions to really tackle only the most prominent parts of the area you're investigating. Or perhaps an opportunity to re-think how the user might think of and interact with the data they've collected, and to use the separation of interface (what the user sees) from implementation (how you've chosen to represent the data in classes) provided by class systems to more effectively encapsulate what the user is likely to do.

Putting the naming and number of classes aside, when you say "difficult to extend for additional data types in the future" it makes me wonder if perhaps some of the nuances of S4 classes are tripping you up? The short solution is to avoid writing your own initialize methods, and rely on the constructors to do the tricky work, along the lines of

setClass("A", representation(x="numeric"))
setClass("B", representation(y="numeric"), contains="A")

A <- function(x = numeric(), ...) new("A", x=x, ...)
B <- function(a = A(), y = numeric(), ...) new("B", a, y=y, ...)

and then

> B(A(1:5), 10)
An object of class "B"
Slot "y":
[1] 10

Slot "x":
[1] 1 2 3 4 5
like image 174
Martin Morgan Avatar answered Nov 09 '22 08:11

Martin Morgan