Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exporting a caret R model with minimum information used to predict

Tags:

r

r-caret

I would like to export the following model below so an other user can open it and use predict function to predict classes on new observation. That is the only thing it will be used for. I can save mod_fit, but it will take up lots of space and the end user can access information which I dont want. Is there any easy way?

library(caret)
library(dplyr)

iris2 <- iris %>% filter(Species != "setosa") %>% mutate(Species = as.character(Species))
mod_fit <- train(Species ~., data = iris2, method = "glm")
like image 792
MLEN Avatar asked Dec 04 '17 12:12

MLEN


1 Answers

The following is a generic procedure of trimming down R objects from data that might not be necessary for the target use. It's heuristic in nature, but I've already applied it successfully twice, and with a bit of luck it works quite well.

You can measure object size using a function called object.size:

> object.size(mod_fit)
528616 bytes

Indeed, quite a lot for a linear model with four predictors. You can inspect what's inside the object using, for example, the str function:

> str(mod_fit)
List of 23
 $ method      : chr "glm"
 $ modelInfo   :List of 15
  ..$ label     : chr "Generalized Linear Model"
  ..$ library   : NULL
  ..$ loop      : NULL
  ..$ type      : chr [1:2] "Regression" "Classification"
  ..$ parameters:'data.frame':  1 obs. of  3 variables:
  .. ..$ parameter: Factor w/ 1 level "parameter": 1
  .. ..$ class    : Factor w/ 1 level "character": 1
[…]
 $ coefnames   : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
 $ xlevels     : Named list()
 - attr(*, "class")= chr [1:2] "train" "train.formula"

Quite a lot of data. So, let's check how much space each of these elements take:

> sort(sapply(mod_fit, object.size))
        pred   preProcess      yLimits         dots     maximize       method 
           0            0            0           40           48           96 
   modelType       metric    perfNames      xlevels    coefnames       levels 
         104          104          160          192          296          328 
        call     bestTune      results        times     resample  resampledCM 
         936         1104         1584         2024         2912         4152 
trainingData        terms      control    modelInfo   finalModel 
        5256         6112        29864       211824       259456 

Now we can try removing elements from this object one-by-one, and check which are necessary for predict to work, starting from the largest:

> test_obj <- mod_fit; test_obj$finalModel <- NULL; predict(test_obj, iris2)
Error in if (modelFit$problemType == "Classification") { : 
  argument is of length zero

Whoops, finalModel seems important. Any kind of error here tells you that you can't remove the element. How about, let say, control?

> test_obj <- mod_fit; test_obj$control <- NULL; predict(test_obj, iris2)
  [1] versicolor versicolor versicolor versicolor versicolor versicolor
  [7] versicolor versicolor versicolor versicolor versicolor versicolor
 [13] versicolor versicolor versicolor versicolor versicolor versicolor
[…]
 [97] virginica  virginica  virginica  virginica 
Levels: versicolor virginica

So, it seems that control is not needed. You can perform this process recursively, for example:

> sort(sapply(mod_fit$finalModel, object.size))
           offset         contrasts             param              rank 
                0                 0                40                48 
[…]
            model            family 
            17056            163936 
> sort(sapply(mod_fit$finalModel$family, object.size))
      link     family   valideta    linkfun    linkinv     mu.eta dev.resids 
        96        104        272        560        560        560       1992 
  variance    validmu initialize        aic   simulate 
      2064       6344      18712      27512     103888 
> test_obj <- mod_fit; test_obj$finalModel$family$simulate <- NULL; predict(test_obj, iris2)
  [1] versicolor versicolor versicolor versicolor versicolor versicolor
[…]
 [97] virginica  virginica  virginica  virginica 
Levels: versicolor virginica

With enough attempts you will know which parts of the object are necessary, and which are not—and remove them before storing the model.

Note: while this may reduce unnecessary parts of the object, you may accidentally remove parts that are only sometimes used in prediction. For simple models that always work the same way, like glm, this should not happen, though.

Also, the result of this process is not guaranteed not to leak information about the model you don't want the model's user to see. There is no such guarantee in general, and there are methods of reconstructing significant information about models and training data even from black-box models that are not usually easy to interpret.

like image 110
liori Avatar answered Nov 14 '22 23:11

liori