Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

modelr: Fitting multiple models with resampled data

Tags:

r

dplyr

purrr

tidyr

In the tidy model of data science (TM) implemented in modelr, resampled data are organized using list-columns:

library(modelr)
library(tidyverse)

# create the k-folds
df_heights_resampled = heights %>% 
  crossv_kfold(k = 10, id = "Resample ID")

It is possible to map a model to each of the training datasets in the list-column train and to compute a performance metric by mapping onto the list-column test.

In case this needs to be done withe multiple models, this needs to be repeated for each of the models.

# create a list of formulas 
formulas_heights = formulas(
  .response = ~ income, 
  model1  = ~ height + weight + marital + sex,
  model2 = ~ height + weight + marital + sex + education
) 

# fit each of the models in the list of formulas
df_heights_resampled = df_heights_resampled %>% 
  mutate(
    model1 = map(train, function(train_data) {
      lm(formulas_heights[[1]], data = train_data)
    }),
    model2 = map(train, function(train_data) {
      lm(formulas_heights[[2]], data = train_data)
    })
  )

# score the models on the test sets
df_heights_resampled = df_heights_resampled %>% 
  mutate(
    rmse1 = map2_dbl(.x = model1, .y = test, .f = rmse),
    rmse2 = map2_dbl(.x = model2, .y = test, .f = rmse)
  )

which gives:

> df_heights_resampled
# A tibble: 10 × 7
            train           test `Resample ID`   model1   model2    rmse1    rmse2
           <list>         <list>         <chr>   <list>   <list>    <dbl>    <dbl>
1  <S3: resample> <S3: resample>            01 <S3: lm> <S3: lm> 58018.35 53903.99
2  <S3: resample> <S3: resample>            02 <S3: lm> <S3: lm> 55117.37 50279.38
3  <S3: resample> <S3: resample>            03 <S3: lm> <S3: lm> 49005.82 44613.93
4  <S3: resample> <S3: resample>            04 <S3: lm> <S3: lm> 55437.07 51068.90
5  <S3: resample> <S3: resample>            05 <S3: lm> <S3: lm> 48845.35 44673.88
6  <S3: resample> <S3: resample>            06 <S3: lm> <S3: lm> 58226.69 54010.50
7  <S3: resample> <S3: resample>            07 <S3: lm> <S3: lm> 56571.93 53322.41
8  <S3: resample> <S3: resample>            08 <S3: lm> <S3: lm> 46084.82 42294.50
9  <S3: resample> <S3: resample>            09 <S3: lm> <S3: lm> 59762.22 54814.55
10 <S3: resample> <S3: resample>            10 <S3: lm> <S3: lm> 45328.48 41882.79

Question:

This can get cumbersome really fast if the number of models to be explored is large. modelr provides the fit_with function that allows to iterate over a number of models (as characterized by multiple formulae) but that does not seem to allow for a list-column like train in the model above. I am assuming that one of the *map* family of functions will make this possible (invoke_map?), but have not been able to figure out how.

like image 774
tchakravarty Avatar asked Oct 23 '16 15:10

tchakravarty


1 Answers

You can programmatically build the calls using map and lazyeval::interp. I'm curious if there is a pure purrr solution, but the issue is that you want to create multiple columns, and you need multiple calls for that. Perhaps a purrr solution would create another list column containing all models.

library(lazyeval)
model_calls <- map(formulas_heights, 
                   ~interp(~map(train, ~lm(form, data = .x)), form = .x))
score_calls <- map(names(model_calls), 
                   ~interp(~map2_dbl(.x = m, .y = test, .f = rmse), m = as.name(.x)))
names(score_calls) <- paste0("rmse", seq_along(score_calls))

df_heights_resampled %>% mutate_(.dots = c(model_calls, score_calls))
# A tibble: 10 × 7
            train           test `Resample ID`   model1   model2    rmse1    rmse2
           <list>         <list>         <chr>   <list>   <list>    <dbl>    <dbl>
1  <S3: resample> <S3: resample>            01 <S3: lm> <S3: lm> 44720.86 41452.07
2  <S3: resample> <S3: resample>            02 <S3: lm> <S3: lm> 54174.38 48823.03
3  <S3: resample> <S3: resample>            03 <S3: lm> <S3: lm> 56854.21 52725.62
4  <S3: resample> <S3: resample>            04 <S3: lm> <S3: lm> 53312.38 48797.48
5  <S3: resample> <S3: resample>            05 <S3: lm> <S3: lm> 61883.90 57469.17
6  <S3: resample> <S3: resample>            06 <S3: lm> <S3: lm> 55709.83 50867.26
7  <S3: resample> <S3: resample>            07 <S3: lm> <S3: lm> 53036.06 48698.07
8  <S3: resample> <S3: resample>            08 <S3: lm> <S3: lm> 55986.83 52717.94
9  <S3: resample> <S3: resample>            09 <S3: lm> <S3: lm> 51738.60 48006.74
10 <S3: resample> <S3: resample>            10 <S3: lm> <S3: lm> 45061.22 41480.35
like image 57
Axeman Avatar answered Oct 03 '22 10:10

Axeman