lm
sets model = TRUE
by default, meaning the entire dataset used for learning is copied and returned with the fitted object. This is used by predict
but creates memory overhead (example below).
I am wondering, is the copied dataset used for any reason other than predict
?
Not essential to answer, but I'd also like to know of models that store data for reasons other than predict
.
object.size(lm(mpg ~ ., mtcars))
#> 45768 bytes
object.size(lm(mpg ~ ., mtcars, model = FALSE))
#> 28152 bytes
Bigger dataset = bigger overhead.
To share my motivation, the twidlr package forces users to provide data when using predict
. If this makes copying the dataset when learning unnecessary, it seems reasonable to save memory by defaulting to model = FALSE
. I've opened a relevant issue here.
A secondary motivation - you can easily fit many models like lm
with pipelearner, but copying data each time creates massive overhead. So finding ways to cut down memory needs would be very handy!
I think model frame is returned as a protection against non-standard evaluation.
Let's look at a small example.
dat <- data.frame(x = runif(10), y = rnorm(10))
FIT <- lm(y ~ x, data = dat)
fit <- FIT; fit$model <- NULL
What is the difference between
model.frame(FIT)
model.frame(fit)
?? Checking methods(model.frame)
and stats:::model.frame.lm
shows that in the first case, model frame is efficiently extracted from FIT$model
; while in the second case, it will be reconstructed from fit$call
and model.frame.default
. Such difference also results in the difference between
# depends on `model.frame`
model.matrix(FIT)
model.matrix(fit)
as model matrix is built from a model frame. If we dig further, we will see that these are different, too,
# depends on `model.matrix`
predict(FIT)
predict(fit)
# depends on `predict.lm`
plot(FIT)
plot(fit)
Note that this is where the problem could be. If we deliberately remove dat
, we can not reconstruct the model frame, then all these will fail:
rm(dat)
model.frame(fit)
model.matrix(fit)
predict(fit)
plot(fit)
while using FIT
will work.
This is not bad enough. The following example under non-standard evaluation is really bad!
fitting <- function (myformula, mydata, keep.mf = FALSE) {
b <- lm(formula = myformula, data = mydata, model = keep.mf)
par(mfrow = c(2,2))
plot(b)
predict(b)
}
Now let's create a data frame again (we have removed it earlier)
dat <- data.frame(x = runif(10), y = rnorm(10))
Can you see that
fitting(y ~ x, dat, keep.mf = TRUE)
works but
fitting(y ~ x, dat, keep.mf = FALSE)
fails?
Here is a question I answered / investigated a year ago: R - model.frame() and non-standard evaluation It was asked for survival
package. That example is really extreme: even if we provide newdata
, we would still get error. Retaining the model frame is the only way to proceed!
Finally on your observation of memory costs. In fact, $model
is not mainly responsible for potentially large lm
object. $qr
is, as it has the same dimension with model matrix. Consider a model with lots of factors, or nonlinear terms like bs
, ns
or poly
, the model frame is much smaller compared with model matrix. So omitting model frame return does not help reduce lm
object size. This is actually one motivation that biglm
is developed.
Since I inevitably mentioned biglm
, I would emphasis again that this method only helps reducing the final model object size, not RAM usage during model fitting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With