Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does `lm` return `model` for reasons other than `predict`

Tags:

memory

r

lm

lm sets model = TRUE by default, meaning the entire dataset used for learning is copied and returned with the fitted object. This is used by predict but creates memory overhead (example below).

I am wondering, is the copied dataset used for any reason other than predict?

Not essential to answer, but I'd also like to know of models that store data for reasons other than predict.

Example

object.size(lm(mpg ~ ., mtcars))
#> 45768 bytes
object.size(lm(mpg ~ ., mtcars, model = FALSE))
#> 28152 bytes

Bigger dataset = bigger overhead.

Motivation

To share my motivation, the twidlr package forces users to provide data when using predict. If this makes copying the dataset when learning unnecessary, it seems reasonable to save memory by defaulting to model = FALSE. I've opened a relevant issue here.

A secondary motivation - you can easily fit many models like lm with pipelearner, but copying data each time creates massive overhead. So finding ways to cut down memory needs would be very handy!

like image 738
Simon Jackson Avatar asked Jun 23 '17 22:06

Simon Jackson


1 Answers

I think model frame is returned as a protection against non-standard evaluation.

Let's look at a small example.

dat <- data.frame(x = runif(10), y = rnorm(10))
FIT <- lm(y ~ x, data = dat)
fit <- FIT; fit$model <- NULL

What is the difference between

model.frame(FIT)
model.frame(fit)

?? Checking methods(model.frame) and stats:::model.frame.lm shows that in the first case, model frame is efficiently extracted from FIT$model; while in the second case, it will be reconstructed from fit$call and model.frame.default. Such difference also results in the difference between

# depends on `model.frame`
model.matrix(FIT)
model.matrix(fit)

as model matrix is built from a model frame. If we dig further, we will see that these are different, too,

# depends on `model.matrix`
predict(FIT)
predict(fit)

# depends on `predict.lm`
plot(FIT)
plot(fit)

Note that this is where the problem could be. If we deliberately remove dat, we can not reconstruct the model frame, then all these will fail:

rm(dat)
model.frame(fit)
model.matrix(fit)
predict(fit)
plot(fit)

while using FIT will work.


This is not bad enough. The following example under non-standard evaluation is really bad!

fitting <- function (myformula, mydata, keep.mf = FALSE) {
  b <- lm(formula = myformula, data = mydata, model = keep.mf)
  par(mfrow = c(2,2))
  plot(b)
  predict(b)
  }

Now let's create a data frame again (we have removed it earlier)

dat <- data.frame(x = runif(10), y = rnorm(10))

Can you see that

fitting(y ~ x, dat, keep.mf = TRUE)

works but

fitting(y ~ x, dat, keep.mf = FALSE)

fails?

Here is a question I answered / investigated a year ago: R - model.frame() and non-standard evaluation It was asked for survival package. That example is really extreme: even if we provide newdata, we would still get error. Retaining the model frame is the only way to proceed!


Finally on your observation of memory costs. In fact, $model is not mainly responsible for potentially large lm object. $qr is, as it has the same dimension with model matrix. Consider a model with lots of factors, or nonlinear terms like bs, ns or poly, the model frame is much smaller compared with model matrix. So omitting model frame return does not help reduce lm object size. This is actually one motivation that biglm is developed.


Since I inevitably mentioned biglm, I would emphasis again that this method only helps reducing the final model object size, not RAM usage during model fitting.

like image 72
Zheyuan Li Avatar answered Oct 24 '22 10:10

Zheyuan Li