I haven't been able to find an answer to this question, largely because googling anything with a standalone letter (like "I") causes issues.
What does the "I" do in a model like this?
data(rock) lm(area~I(peri - mean(peri)), data = rock)
Considering that the following does NOT work:
lm(area ~ (peri - mean(peri)), data = rock)
and that this does work:
rock$peri - mean(rock$peri)
Any key words on how to research this myself would also be very helpful.
I isolates or insulates the contents of I( ... ) from the gaze of R's formula parsing code. It allows the standard R operators to work as they would if you used them outside of a formula, rather than being treated as special formula operators. For example: y ~ x + x^2.
In the context of simple linear regression: R: The correlation between the predictor variable, x, and the response variable, y. R2: The proportion of the variance in the response variable that can be explained by the predictor variable in the regression model.
Simply put, R is the correlation between the predicted values and the observed values of Y. R square is the square of this coefficient and indicates the percentage of variation explained by your regression line out of the total variation. This value tends to increase as you include additional predictors in the model.
For linear regression, r -squared is used as an effect size statistic. It indicates the proportion of the variability in the dependent variable that is explained by model. That is, an r -squared of 0.60 indicates that 60% of the variability in the dependent variable is explained by the model.
Correlation and linear regression each explore the relationship between two quantitative variables. Both are very common analyses. Correlation determines if one variable varies systematically as another variable changes. It does not specify that one variable is the dependent variable and the other is the independent variable.
The regression functions use `model.matrix and that function will recognize the presence of factors or character vectors in the formula and build a matrix that expand the levels of the discrete components of the formula. In plot ()-ting functions it basically reverses the usual (x, y) order of arguments that the plot function usually takes.
You can get a low R-squared for a good model, or a high R-square for a poorly fitted model, and vice versa. I added a paragraph pointing out that with linear regression, R2 can be negative only when the intercept (or perhaps the slope) is constrained.
I
isolates or insulates the contents of I( ... )
from the gaze of R's formula parsing code. It allows the standard R operators to work as they would if you used them outside of a formula, rather than being treated as special formula operators.
For example:
y ~ x + x^2
would, to R, mean "give me:
x
= the main effect of x
, andx^2
= the main effect and the second order interaction of x
",not the intended x
plus x
-squared:
> model.frame( y ~ x + x^2, data = data.frame(x = rnorm(5), y = rnorm(5))) y x 1 -1.4355144 -1.85374045 2 0.3620872 -0.07794607 3 -1.7590868 0.96856634 4 -0.3245440 0.18492596 5 -0.6515630 -1.37994358
This is because ^
is a special operator in a formula, as described in ?formula
. You end up only including x
in the model frame because the main effect of x
is already included from the x
term in the formula, and there is nothing to cross x
with to get the second-order interactions in the x^2
term.
To get the usual operator, you need to use I()
to isolate the call from the formula code:
> model.frame( y ~ x + I(x^2), data = data.frame(x = rnorm(5), y = rnorm(5))) y x I(x^2) 1 -0.02881534 1.0865514 1.180593.... 2 0.23252515 -0.7625449 0.581474.... 3 -0.30120868 -0.8286625 0.686681.... 4 -0.67761458 0.8344739 0.696346.... 5 0.65522764 -0.9676520 0.936350....
(that last column is correct, it just looks odd because it is of class AsIs
.)
In your example, -
when used in a formula would indicate removal of a term from the model, where you wanted -
to have it's usual binary operator meaning of subtraction:
> model.frame( y ~ x - mean(x), data = data.frame(x = rnorm(5), y = rnorm(5))) Error in model.frame.default(y ~ x - mean(x), data = data.frame(x = rnorm(5), : variable lengths differ (found for 'mean(x)')
This fails for reason that mean(x)
is a length 1 vector and model.frame()
quite rightly tells you this doesn't match the length of the other variables. A way round this is I()
:
> model.frame( y ~ I(x - mean(x)), data = data.frame(x = rnorm(5), y = rnorm(5))) y I(x - mean(x)) 1 1.1727063 1.142200.... 2 -1.4798270 -0.66914.... 3 -0.4303878 -0.28716.... 4 -1.0516386 0.542774.... 5 1.5225863 -0.72865....
Hence, where you want to use an operator that has special meaning in a formula, but you need its non-formula meaning, you need to wrap the elements of the operation in I( )
.
Read ?formula
for more on the special operators, and ?I
for more details on the function itself and its other main use-case within data frames (which is where the AsIs
bit originates from, if you are interested).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With