Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Smooth every group via `do`

Tags:

r

dplyr

I have some data, a sample of which below. My goal is to apply a gam to each Year, and to have another value that is the predicted value from gam model.

fertility <- structure(list(AGE = c(15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 
23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 
36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 15L, 16L, 17L, 18L, 
19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 
32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L
), Year = c(1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 
1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 
1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1930, 1931, 
1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 
1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 1931, 
1931, 1931, 1931, 1931, 1931, 1931, 1931), fertility = c(5.170284269, 
14.18135114, 27.69795144, 44.61216712, 59.08896308, 89.66036496, 
105.4563852, 120.1754041, 137.4074262, 148.7159407, 161.5645606, 
157.200515, 143.6340251, 127.8855125, 117.7343628, 159.2909484, 
126.6158821, 109.0681613, 86.98223678, 70.64470361, 111.0070633, 
86.15051988, 68.9204159, 55.92722274, 42.93402958, 56.84376018, 
39.35337243, 26.72142573, 18.46207596, 9.231037978, 4.769704534, 
13.08261815, 25.55198857, 41.15573626, 54.51090896, 81.99522459, 
96.44082973, 109.9015072, 125.6603492, 136.0020892, 148.679958, 
144.6639404, 132.1793638, 117.6867783, 108.345172, 144.2820726, 
114.68575, 98.79142865, 78.7865069, 63.9883456, 100.217918, 77.77726461, 
62.22181169, 50.49147014, 38.76112859, 52.48807067, 36.33789508, 
24.67387938, 17.04740757, 8.523703784)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -60L), .Names = c("AGE", 
"Year", "fertility"))

So, the non-dplyr, "dumb" way to do it would be

count <- 0
for (i in 1930:1931){
  count <- count + 1
  temp <- filter(fertility, Year == i)
  mod <- mgcv::gam(fertility ~ s(AGE), data=temp)
  pred[length(15:44) * (count - 1) + 1:30] <- predict(mod, newdata = data.frame(AGE = 15:44))
}

fertility1 <- mutate(fertility, pred = pred)

But I'd like a method in dplyr. My thought was to use do to create a model for each column, then use predict to obtain the values. The first step I can do, but I'm struggling to implement the second part in dplyr:

library(mgcv)
library(dplyr)

  fertility %>%
    #filter(!is.na(fertility)) %>%  # not sure if this is necessary
    group_by(Year) %>%
    dplyr::do(model = mgcv::gam(fertility ~ s(AGE), data = .)) %>%
    left_join(fertility, .) %>%
    mutate(smoothed = predict(model, newdata = AGE))

I get the error message

Error in UseMethod("predict") : 
  no applicable method for 'predict' applied to an object of class "list"

which presumably means dplyr doesn't remember that model is a model, rather than just a list element.

like image 434
Hugh Avatar asked Dec 01 '22 17:12

Hugh


1 Answers

The smart way to do this would be to use factor-smooth interactions that have been available in mgcv for ages, either via by terms in s() or via the newer bs = "fs" basis type. Here is an example with your data:

library("mgcv")
## Make Year a factor
fertility <- transform(fertility, Year = factor(Year))
## Fit model using by terms - include factor as fixed effect too!
mod <- gam(fertility ~ Year + s(AGE, by = Year), data = fertility)
## Plot to see what form this model takes
plot(mod, pages = 1)

enter image description here

## Some prediction data
ages <- with(fertility, seq(min(AGE), max(AGE)))
## Need to replicate this once per Year
pdat <- with(fertility,
             data.frame(AGE = rep(ages, nlevels(Year)),
                        Year = rep(levels(Year), each = length(ages))))
## Add the fitted values to the prediction data
pdat <- transform(pdat, fitted = predict(mod, newdata = pdat))
head(pdat)

> head(pdat)
  AGE Year     fitted
1  15 1930 -0.8496705
2  16 1930 15.9568574
3  17 1930 33.0754019
4  18 1930 50.7419122
5  19 1930 68.9116594
6  20 1930 87.1306489

However, you can just ask for the fitted values if all you want to do is predict for the observed values of AGES:

fertility <- transform(fertility, fitted = predict(mod))
head(fertility)

> head(fertility)
  AGE Year fertility     fitted
1  15 1930  5.170284 -0.8496705
2  16 1930 14.181351 15.9568574
3  17 1930 27.697951 33.0754019
4  18 1930 44.612167 50.7419122
5  19 1930 59.088963 68.9116594
6  20 1930 89.660365 87.1306489

You might also look at the specific factor-smooth basis type bs = "fs" and ?smooth.terms and ?factor.smooth.interaction for details; basically these are efficient if you have a lot of levels but you want each level's smoother to have the same value of the smoothing parameter.

The main advantage here is that you use all your data and fit a single model, which you can then interrogate in a number of ways not easily open to you if you fit m separate models, such as being able to investigate differences in the smoothers per year.

like image 118
Gavin Simpson Avatar answered Dec 04 '22 12:12

Gavin Simpson