Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple formulae with shared parameters in R

Tags:

r

formulas

We're trying to come up with a way for an R function to handle a model which has multiple responses, multiple explanatory variables, and possibly shared parameters between the responses. For example:

Y1 ~ X1 + X2 + X3
Y2 ~ X3 + X4

specifies two responses and four explanatory variables. X3 appears in both, and we want the user to control whether the associated parameter value is the same or different. ie:

Y1 = b1 X1 + b2 X2 + b3 X3
Y2 = b3 X3 + b4 X4

which is a model with four 'b' parameters, or

Y1 = b1 X1 + b2 X2 + b3 X3
Y2 = b4 X3 + b5 X4

a model with five parameters.

Two possibilities:

  • Specify all the explanatory variables in one formula and supply a matrix mapping responses to explanatories. In which case

Foo( Y1+Y2 ~ X1 + X2 + X3 + X4 + X5, map=cbind(c(1,1,1,0),c(0,0,1,1)))

would correspond to the first case, and

Foo( Y1+Y2 ~ X1 + X2 + X3 + X4 + X5, map=cbind(c(1,1,1,0,0),c(0,0,0,1,1)))

would be the second. Obviously some parsing of the LHS would be needed, or it could be cbind(Y1,Y2). The advantage of this notation is that there is also other information that might be required for each parameter - starting values, priors etc - and the ordering is given by the ordering in the formula.

  • Have multiple formulae and a grouping function that just adds an attribute so shared parameters can be identified - the two examples then become:

Foo( Y1 ~ X1+X2+G(X3,1), Y2 ~ G(X3,1)+X4)

where the X3 parameter is shared between the formula, and

Foo( Y1 ~ X1+X2+X3, Y2 ~ X3+X4)

which has independent parameters. The second parameter of G() is a grouping ID which gives the power to share model parameters flexibly.

A further explanation of the G function is shown by the following:

Foo( Y1 + X1+X2+G(X3,1), Y2~G(X3,1)+G(X4,2), Y3~G(X3,3)+G(X4,2), Y4~G(X3,3))

would be a model where:

Y1 = b1 X1 + b2 X2 + b3 X3
Y2 = b3 X3 + b4 X4
Y3 = b5 X3 + b4 X4
Y4 = b5 X3

where there are two independent parameters for X3 (G(X3,1) and G(X3,3)). How to handle a group that refers to a different explanatory variable is an open question - suppose that model had Y4~G(X3,2) - that seems to imply a shared parameter between different explanatory variables, since there's a G(X4,2) in there.

This notation seems easier for the user to comprehend, but if you also have to specify starting values then the mapping between a vector of starting values and the parameters they correspond to is no longer obvious. I suspect that internally we'd have to compute the mapping matrix from the G() notation.

There may be better ways of doing this, so my question is... does anyone know one?

like image 473
Spacedman Avatar asked Sep 25 '12 13:09

Spacedman


1 Answers

Interesting question (I wish all package authors worried a lot more in advance about how they were going to create extensions to the basic Wilkinson-Rogers formula notation ...)

How about something like

formula=list(Y1~X1+X2+X3,Y2~X3+X4,Y3~X3+X4,Y4~X3),
   shared=list(Y1+Y2~X3,Y2+Y3~X4,Y3+Y4~X3)

or something like that for your second example above?

The formula component gives the list of equations.

The shared component simply lists which response variables share the same parameter for specified predictor variables. It could obviously be mapped into a logical or binary table, but (for me at least -- this is certainly in the eye of the beholder) it's more straightforward. I think the map solution above is awkward when (as in this case) a variable (such as X3) is shared in two distinct sets of relationships.

I guess some straightforward rule like "starting values in the order in which the parameters appear in the list of formulas" -- in this case

X1, X2, X3(1), X4, X3(2)

would be OK, but it might be nice to provide a helper function that would tell the users the names of the coefficient vector (i.e. the order) given a formula/shared specification ...

From a bit of personal experience, I would say that embedding more fanciness in the formula itself leads to pain ... for example, the original nlme syntax with the random effects specified separately was easier to deal with than the new lme4-style syntax with random effects and fixed effects mixed in the same formula ...

An alternative (which I don't like nearly as well) would be

 formula=list(Y1~X1+X2+X3,Y2~X3+X4,Y3~X3[2]+X4,Y4~X3[2])

where new parameters are indicated by some sort of tag (with [1] being implicit).

Also note suggestion from the chat room by @Andrie that interfaces designed for structural equation modeling (sem, lavaan packages) may be useful references.

like image 161
Ben Bolker Avatar answered Sep 21 '22 10:09

Ben Bolker