Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is caret train taking up so much memory?

Tags:

When I train just using glm, everything works, and I don't even come close to exhausting memory. But when I run train(..., method='glm'), I run out of memory.

Is this because train is storing a lot of data for each iteration of the cross-validation (or whatever the trControl procedure is)? I'm looking at trainControl and I can't find how to prevent this...any hints? I only care about the performance summary and maybe the predicted responses.

(I know it's not related to storing data from each iteration of the parameter-tuning grid search because there's no grid for glm's, I believe.)

like image 362
Yang Avatar asked Jul 01 '11 05:07

Yang


1 Answers

The problem is two fold. i) train doesn't just fit a model via glm(), it will bootstrap that model, so even with the defaults, train() will do 25 bootstrap samples, which, coupled with problem ii) is the (or a) source of your problem, and ii) train() simply calls the glm() function with its defaults. And those defaults are to store the model frame (argument model = TRUE of ?glm), which includes a copy of the data in model frame style. The object returned by train() already stores a copy of the data in $trainingData, and the "glm" object in $finalModel also has a copy of the actual data.

At this point, simply running glm() using train() will be producing 25 copies of the fully expanded model.frame and the original data, which will all need to be held in memory during the resampling process - whether these are held concurrently or consecutively is not immediately clear from a quick look at the code as the resampling happens in an lapply() call. There will also be 25 copies of the raw data.

Once the resampling is finished, the returned object will contain 2 copies of the raw data and a full copy of the model.frame. If your training data is large relative to available RAM or contains many factors to be expanded in the model.frame, then you could easily be using huge amounts of memory just carrying copies of the data around.

If you add model = FALSE to your train call, that might make a difference. Here is a small example using the clotting data in ?glm:

clotting <- data.frame(u = c(5,10,15,20,30,40,60,80,100),                        lot1 = c(118,58,42,35,27,25,21,19,18),                        lot2 = c(69,35,26,21,18,16,13,12,12)) require(caret) 

then

> m1 <- train(lot1 ~ log(u), data=clotting, family = Gamma, method = "glm",  +             model = TRUE) Fitting: parameter=none  Aggregating results Fitting model on full training set > m2 <- train(lot1 ~ log(u), data=clotting, family = Gamma, method = "glm", +             model = FALSE) Fitting: parameter=none  Aggregating results Fitting model on full training set > object.size(m1) 121832 bytes > object.size(m2) 116456 bytes > ## ordinary glm() call: > m3 <- glm(lot1 ~ log(u), data=clotting, family = Gamma) > object.size(m3) 47272 bytes > m4 <- glm(lot1 ~ log(u), data=clotting, family = Gamma, model = FALSE) > object.size(m4) 42152 bytes 

So there is a size difference in the returned object and memory use during training will be lower. How much lower will depend on whether the internals of train() keep all copies of the model.frame in memory during the resampling process.

The object returned by train() is also significantly larger than that returned by glm() - as mentioned by @DWin in the comments, below.

To take this further, either study the code more closely, or email Max Kuhn, the maintainer of caret, to enquire about options to reduce the memory footprint.

like image 85
Gavin Simpson Avatar answered Sep 19 '22 17:09

Gavin Simpson