In R, the formula object is symbolic and it seems rather hard to parse. However, I need to parse such a formula into an explicit set of labels for use outside of R.
(1)
Letting f
represent the model formulae in which a response is not specified, e.g. ~V1 + V2 + V3
, one thing I tried was:
t <- terms(f)
attr(t, "term.labels")
However, this doesn't get what is exactly explicit if some of the variables in f
are categorical. For example, let V1
be a categorical variable with 2 categories, i.e. a boolean, and let V2
be a double.
Therefore, a model that is specified by ~V1:V2
should have 2 parameters: "intercept" and "xyes:z". Meanwhile, a model that is specified by ~V1:V2 - 1
should have parameters "xno:z" and "xyes:z". However, without a way of telling the function terms()
which variables are categorical (and how many categories) is has no way of being able to interpret these. Instead, it just has V1:V2
in its "terms.labels" which doesn't mean anything in the context that V1
is categorical.
(2)
On the other hand, using model.matrix
is an easy way to get exactly what I want. The problem is that it requires a data
argument, which is bad for me because I only want an explicit interpretation of the symbolic formula for use outside of R. This method of getting that will waste a lot time (comparatively) because R has to read the data from an outside source when all it really needs to know for the formula is which variables are categorical (and how many categories) and which variables are doubles.
Is there any way to use 'model.matrix' with only specifying the types of data, rather than the actual data? If not, what else is a viable solution?
Well, it is only in the context of having data that it can be determined whether a given variable is a factor or numeric. So you can't do it without the data argument. But all you need is the structure, not the data itself, so you can use a 0-row data frame with the columns of all the right types.
f <- ~ V1:V2
V1numeric <- data.frame(V1=numeric(0), V2=numeric(0))
V1factor <- data.frame(V1=factor(c(), levels=c("no","yes")), V2=numeric(0))
Looking at the two data.frames:
> V1numeric
[1] V1 V2
<0 rows> (or 0-length row.names)
> str(V1numeric)
'data.frame': 0 obs. of 2 variables:
$ V1: num
$ V2: num
> V1factor
[1] V1 V2
<0 rows> (or 0-length row.names)
> str(V1factor)
'data.frame': 0 obs. of 2 variables:
$ V1: Factor w/ 2 levels "no","yes":
$ V2: num
Use model.matrix
with these
> model.matrix(f, data=V1numeric)
(Intercept) V1:V2
attr(,"assign")
[1] 0 1
> model.matrix(f, data=V1factor)
(Intercept) V1no:V2 V1yes:V2
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$V1
[1] "contr.treatment"
If you have a real data set, it is easy to get a 0-row data.frame from that which retains the column information. Just subscript it with FALSE
. You don't need to build the data.frame by hand if you have one with the right properties.
> str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
> str(mtcars[FALSE,])
'data.frame': 0 obs. of 11 variables:
$ mpg : num
$ cyl : num
$ disp: num
$ hp : num
$ drat: num
$ wt : num
$ qsec: num
$ vs : num
$ am : num
$ gear: num
$ carb: num
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With