I'm trying to create a more parsimonious version of this solution, which entails specifying the RHS of a formula in the form <code>d1 + d1:d2</code>. Given that <code>*</code> in the context of a formula is a pithy stand-in for full interaction (i.e. <code>d1 * d2</code> gives <code>d1 + d2 + d1:d2</code>), my approach has been to try and define an alternative operator, say <code>%+:%</code> using the infix approach I've grown accustomed to in other applications, a la: <pre class="prettyprint"><code>"%+:%" <- function(d1,d2) d1 + d2 + d1:d2 </code></pre> However, this predictably fails because I haven't been careful about evaluation; let's introduce an example to illustrate my progress: <pre class="prettyprint"><code>set.seed(1029) v1 <- runif(1000) v2 <- runif(1000) y <- .8*(v1 < .3) + .2 * (v2 > .25 & v2 < .8) - .4 * (v2 > .8) + .1 * (v1 > .3 & v2 > .8) </code></pre> With this example, hopefully it's clear why simply writing out the two terms might be undesirable: <pre class="prettyprint"><code>y ~ cut(v2, breaks = c(0, .25, .8, 1)) + cut(v2, breaks = c(0, .25, .8, 1)):I(v1 < .3) </code></pre> One workaround which is close to my desired output is to define the whole formula as a function: <pre class="prettyprint"><code>plus.times <- function(outvar, d1, d2){ as.formula(paste0(quote(outvar), "~", quote(d1), "+", quote(d1), ":", quote(d2))) } </code></pre> This gives the expected coefficients when passed to <code>lm</code>, but with names that are harder to interpret directly (especially in the real data where we take care to give <code>d1</code> and <code>d2</code> descriptive names, in contrast to this generic example): <pre class="prettyprint"><code>out1 <- lm(y ~ cut(v2, breaks = c(0, .25, .8, 1)) + cut(v2, breaks = c(0, .25, .8, 1)):I(v1 < .3)) out2 <- lm(plus.times(y, cut(v2, breaks = c(0, .25, .8, 1)), I(v1 < .3))) any(out1$coefficients != out2$coefficients) # [1] FALSE names(out2$coefficients) # [1] "(Intercept)" "d1(0.25,0.8]" "d1(0.8,1]" "d1(0,0.25]:d2TRUE" # [5] "d1(0.25,0.8]:d2TRUE" "d1(0.8,1]:d2TRUE" </code></pre> So this is less than optimal. Is there any way to define the adjust the code so that the infix operator I mentioned above works as expected? How about altering the form of <code>plus.times</code> so that the variables are not renamed? I've been poking around (<code>?formula</code>, <code>?"~"</code>, <code>?":"</code>, <code>getAnywhere(formula.default)</code>, this answer, etc.) but haven't seen how exactly R interprets <code>*</code> when it's encountered in a formula so that I can make my desired minor adjustments.

You do not need to define a new operator in this case: in a formula <code>d1/d2</code> expands to <code>d1 + d1:d2</code>. In other words <code>d1/d2</code> specifies that <code>d2</code> is nested within <code>d1</code>. Continuing your example: <pre class="prettyprint"><code>out3 <- lm(y ~ cut(v2,breaks=c(0,.25,.8,1))/I(v1 < .3)) all.equal(coef(out1), coef(out3)) # [1] TRUE </code></pre> Further comments Factors may be crossed or nested. Two factors are crossed if it possible to observe every combination of levels of the two factors, e.g. sex and treatment, temperature and pH, etc. A factor is nested within another if each level of that factor can only be observed within one of the levels of the other factor, e.g. town and country, staff member and store etc. These relationships are reflected in the parametrization of the model. For crossed factors we use <code>d1*d2</code> or <code>d1 + d2 + d1:d2</code>, to give the main effect of each factor, plus the interaction. For nested factors we use <code>d1/d2</code> or <code>d1 + d1:d2</code> to give a separate submodel of the form <code>1 + d2</code> for each level of <code>d1</code>. The idea of nesting is not restricted to factors, for example we may use <code>sex/x</code> to fit a separate linear regression on <code>x</code> for males and females. In a formula, <code>%in%</code> is equivalent to <code>:</code>, but it may be used to emphasize the nested, or hierarchical structure of the data/model. For example, <code>a + b %in% a</code> is the same as <code>a + a:b</code>, but reading it as "a plus b within a" gives a better description of the model being fitted. Even so, using <code>/</code> has the advantage of simplifying the model formula at the same time as emphasizing the structure.

Defining an infix operator for use within a formula

Tags:

r

formula

infix-operator

I'm trying to create a more parsimonious version of this solution, which entails specifying the RHS of a formula in the form d1 + d1:d2.

Given that * in the context of a formula is a pithy stand-in for full interaction (i.e. d1 * d2 gives d1 + d2 + d1:d2), my approach has been to try and define an alternative operator, say %+:% using the infix approach I've grown accustomed to in other applications, a la:

"%+:%" <- function(d1,d2) d1 + d2 + d1:d2

However, this predictably fails because I haven't been careful about evaluation; let's introduce an example to illustrate my progress:

set.seed(1029)
v1 <- runif(1000)
v2 <- runif(1000)
y <- .8*(v1 < .3) + .2 * (v2 > .25 & v2 < .8) - 
  .4 * (v2 > .8) + .1 * (v1 > .3 & v2 > .8)

With this example, hopefully it's clear why simply writing out the two terms might be undesirable:

y ~ cut(v2, breaks = c(0, .25, .8, 1)) +
  cut(v2, breaks = c(0, .25, .8, 1)):I(v1 < .3)

One workaround which is close to my desired output is to define the whole formula as a function:

plus.times <- function(outvar, d1, d2){
  as.formula(paste0(quote(outvar), "~", quote(d1),
                    "+", quote(d1), ":", quote(d2)))
}

This gives the expected coefficients when passed to lm, but with names that are harder to interpret directly (especially in the real data where we take care to give d1 and d2 descriptive names, in contrast to this generic example):

out1 <- lm(y ~ cut(v2, breaks = c(0, .25, .8, 1)) +
             cut(v2, breaks = c(0, .25, .8, 1)):I(v1 < .3))
out2 <- lm(plus.times(y, cut(v2, breaks = c(0, .25, .8, 1)), I(v1 < .3)))
any(out1$coefficients != out2$coefficients)
# [1] FALSE
names(out2$coefficients)
# [1] "(Intercept)"         "d1(0.25,0.8]"        "d1(0.8,1]"           "d1(0,0.25]:d2TRUE"  
# [5] "d1(0.25,0.8]:d2TRUE" "d1(0.8,1]:d2TRUE"

So this is less than optimal.

Is there any way to define the adjust the code so that the infix operator I mentioned above works as expected? How about altering the form of plus.times so that the variables are not renamed?

I've been poking around (?formula, ?"~", ?":", getAnywhere(formula.default), this answer, etc.) but haven't seen how exactly R interprets * when it's encountered in a formula so that I can make my desired minor adjustments.

924

asked Sep 16 '15 19:09

MichaelChirico

1 Answers

You do not need to define a new operator in this case: in a formula d1/d2 expands to d1 + d1:d2. In other words d1/d2 specifies that d2 is nested within d1. Continuing your example:

out3 <- lm(y ~ cut(v2,breaks=c(0,.25,.8,1))/I(v1 < .3))
all.equal(coef(out1), coef(out3))
# [1] TRUE

Further comments

Factors may be crossed or nested. Two factors are crossed if it possible to observe every combination of levels of the two factors, e.g. sex and treatment, temperature and pH, etc. A factor is nested within another if each level of that factor can only be observed within one of the levels of the other factor, e.g. town and country, staff member and store etc.

These relationships are reflected in the parametrization of the model. For crossed factors we use d1*d2 or d1 + d2 + d1:d2, to give the main effect of each factor, plus the interaction. For nested factors we use d1/d2 or d1 + d1:d2 to give a separate submodel of the form 1 + d2 for each level of d1.

The idea of nesting is not restricted to factors, for example we may use sex/x to fit a separate linear regression on x for males and females.

In a formula, %in% is equivalent to :, but it may be used to emphasize the nested, or hierarchical structure of the data/model. For example, a + b %in% a is the same as a + a:b, but reading it as "a plus b within a" gives a better description of the model being fitted. Even so, using / has the advantage of simplifying the model formula at the same time as emphasizing the structure.

157

answered Oct 26 '22 10:10

Heather Turner

Related questions
                            
                                Non consecutive combinations of array elements in R
                            
                                Reproducing "Computer composition with lines" in R
                            
                                R write.csv with UTF-16 encoding
                            
                                Can I run package.skeleton without parsing functions into separate files?
                            
                                How can I visualize hierarchical data?
                            
                                Converting numeric type vector into a vector of strings
                            
                                What is the R equivalent of Matlab's fminunc function?
                            
                                Distributing a compiled executable with an R package
                            
                                Ordering the axis labels in geom_tile
                            
                                Converting a character type to a logical
                            
                                R_HOME Error with rpy2
                            
                                Plot tables and relationships from Postgresql tables
                            
                                Roxygen2 - how to @export reference class generator?
                            
                                How to easily execute R commands on remote server?
                            
                                Overlay raster plot using plot(...,add=T) leads to arbitrary misalignment of final plot
                            
                                Disconnecting src_tbls connection in dplyr
                            
                                Recursive list.files for FTP-Server
                            
                                Automatically scale font size (etc.) of ggplot2 inside an Rmarkdown document
                            
                                Plot animation in knitr rmarkdown
                            
                                R shiny: center and resize textInput

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With