Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using R to interpret a symbolic formula for outside use

Tags:

parsing

r

formula

In R, the formula object is symbolic and it seems rather hard to parse. However, I need to parse such a formula into an explicit set of labels for use outside of R.

(1)

Letting f represent the model formulae in which a response is not specified, e.g. ~V1 + V2 + V3, one thing I tried was:

t <- terms(f)
attr(t, "term.labels")

However, this doesn't get what is exactly explicit if some of the variables in f are categorical. For example, let V1 be a categorical variable with 2 categories, i.e. a boolean, and let V2 be a double.

Therefore, a model that is specified by ~V1:V2 should have 2 parameters: "intercept" and "xyes:z". Meanwhile, a model that is specified by ~V1:V2 - 1 should have parameters "xno:z" and "xyes:z". However, without a way of telling the function terms() which variables are categorical (and how many categories) is has no way of being able to interpret these. Instead, it just has V1:V2 in its "terms.labels" which doesn't mean anything in the context that V1 is categorical.

(2)

On the other hand, using model.matrix is an easy way to get exactly what I want. The problem is that it requires a data argument, which is bad for me because I only want an explicit interpretation of the symbolic formula for use outside of R. This method of getting that will waste a lot time (comparatively) because R has to read the data from an outside source when all it really needs to know for the formula is which variables are categorical (and how many categories) and which variables are doubles.

Is there any way to use 'model.matrix' with only specifying the types of data, rather than the actual data? If not, what else is a viable solution?

like image 595
Jon Claus Avatar asked May 16 '13 16:05

Jon Claus


1 Answers

Well, it is only in the context of having data that it can be determined whether a given variable is a factor or numeric. So you can't do it without the data argument. But all you need is the structure, not the data itself, so you can use a 0-row data frame with the columns of all the right types.

f <- ~ V1:V2
V1numeric <- data.frame(V1=numeric(0), V2=numeric(0))
V1factor <- data.frame(V1=factor(c(), levels=c("no","yes")), V2=numeric(0))

Looking at the two data.frames:

> V1numeric
[1] V1 V2
<0 rows> (or 0-length row.names)
> str(V1numeric)
'data.frame':   0 obs. of  2 variables:
 $ V1: num 
 $ V2: num 
> V1factor
[1] V1 V2
<0 rows> (or 0-length row.names)
> str(V1factor)
'data.frame':   0 obs. of  2 variables:
 $ V1: Factor w/ 2 levels "no","yes": 
 $ V2: num 

Use model.matrix with these

> model.matrix(f, data=V1numeric)
     (Intercept) V1:V2
attr(,"assign")
[1] 0 1
> model.matrix(f, data=V1factor)
     (Intercept) V1no:V2 V1yes:V2
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$V1
[1] "contr.treatment"

If you have a real data set, it is easy to get a 0-row data.frame from that which retains the column information. Just subscript it with FALSE. You don't need to build the data.frame by hand if you have one with the right properties.

> str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
> str(mtcars[FALSE,])
'data.frame':   0 obs. of  11 variables:
 $ mpg : num 
 $ cyl : num 
 $ disp: num 
 $ hp  : num 
 $ drat: num 
 $ wt  : num 
 $ qsec: num 
 $ vs  : num 
 $ am  : num 
 $ gear: num 
 $ carb: num 
like image 101
Brian Diggs Avatar answered Sep 20 '22 20:09

Brian Diggs