Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

stat_smooth gam not the same as gam {mgcv}

I was using the stat_smooth function in ggplot2, decided I wanted the "goodness of fit", and used mgvc gam for that. It occurred to me that I should check to make sure that they were the same model (stat_smooth vs mgvc gam), so I used the code below to check. Seemingly, they have different results, as evidenced by the plot (Plot: stat_smoother gam (red), mgcv gam (black)). However, I don't know why they have different results. Is it that some default parameter is different between the two? Is is that gam is being run on a numeric x and stat_smooth is being run with a POSIXct x (if so - I don't know what to do about that)? It looks like stat_smooth is smoother, but the k values are the same...

I think there are several posts on how to plot gam outputs in ggplot2, but I'd really like to know why stat_smooth and mgcv are giving different results in the first place. I am very new to GAM (and R), so it's quite possible I'm missing something easy. However, I did google and search this forum before asking.

My data is a bit big to easily share, so I used a sample dataset - I've put the source in the code, as well as a dput() below everything, and my sessionInfo() after that.

I have tried to make a quality question, but it is only my second one. Ever. So, constructive criticism is appreciated.

Thank you!

library(readxl)
library(data.table)
library(ggplot2)
library(scales)
library(mgcv)

stackOF_data <- read_excel("mean-daily-flow-cumecs-vatnsdals.xlsx", sheet = "Data")
stackOF_data <- data.table(stackOF_data)
stackOF_data <- stackOF_data[,.(timeseries=as.POSIXct(Date,format("%Y-%m-%d")),mdf)]

a <- stackOF_data[,.(x=as.numeric(timeseries),y=mdf)]
a1 <- gam(y~s(x, k=100, bs="cs"),data=a) 
a2=data.table(gam_mdf= predict(a1,a))
a2=cbind(timeseries=stackOF_data$timeseries,a2)

# see if predict and actual are the same
p <- ggplot() + 
geom_line(data = a2, aes(x = timeseries, y = gam_mdf), size=1)+
scale_color_manual(values=c("black","magenta"))+
scale_y_continuous()+
scale_x_datetime(labels = date_format("%Y-%m-%d"), breaks = "1 month", minor_breaks = "1 week")+ 
theme(axis.text.x=element_text(angle=50, size=10,hjust=1))+
stat_smooth(data = stackOF_data, aes(x = (timeseries), y = mdf),method="gam", formula=y~s(x,k=100, bs="cs"), col="red", se=FALSE, size=1)
p

# data from: https://datamarket.com/data/set/235m/mean-daily-flow-cumecs-vatnsdalsa-river-1-jan-1972-31-dec-1974#!ds=235m&display=line&s=14l

> dput(a)
structure(list(x = c(126230400, 126316800, 126403200, 126489600, 
126576000, 126662400, 126748800, 126835200, 126921600, 127008000, 
127094400, 127180800, 127267200, 127353600, 127440000, 127526400, 
127612800, 127699200, 127785600, 127872000, 127958400, 128044800, 
128131200, 128217600, 128304000, 128390400, 128476800, 128563200, 
128649600, 128736000, 128822400, 128908800, 128995200, 129081600, 
129168000, 129254400, 129340800, 129427200, 129513600, 129600000, 
129686400, 129772800, 129859200, 129945600, 130032000, 130118400, 
130204800, 130291200, 130377600, 130464000, 130550400, 130636800, 
130723200, 130809600, 130896000, 130982400, 131068800, 131155200, 
131241600, 131328000, 131414400, 131500800, 131587200, 131673600, 
131760000, 131846400, 131932800, 132019200, 132105600, 132192000, 
132278400, 132364800, 132451200, 132537600, 132624000, 132710400, 
132796800, 132883200, 132969600, 133056000, 133142400, 133228800, 
133315200, 133401600, 133488000, 133574400, 133660800, 133747200, 
133833600, 133920000, 134006400, 134092800, 134179200, 134265600, 
134352000, 134438400, 134524800, 134611200, 134697600, 134784000, 
134870400, 134956800, 135043200, 135129600, 135216000, 135302400, 
135388800, 135475200, 135561600, 135648000, 135734400, 135820800, 
135907200, 135993600, 136080000, 136166400, 136252800, 136339200, 
136425600, 136512000, 136598400, 136684800, 136771200, 136857600, 
136944000, 137030400, 137116800, 137203200, 137289600, 137376000, 
137462400, 137548800, 137635200, 137721600, 137808000, 137894400, 
137980800, 138067200, 138153600, 138240000, 138326400, 138412800, 
138499200, 138585600, 138672000, 138758400, 138844800, 138931200, 
139017600, 139104000, 139190400, 139276800, 139363200, 139449600, 
139536000, 139622400, 139708800, 139795200, 139881600, 139968000, 
140054400, 140140800, 140227200, 140313600, 140400000, 140486400, 
140572800, 140659200, 140745600, 140832000, 140918400, 141004800, 
141091200, 141177600, 141264000, 141350400, 141436800, 141523200, 
141609600, 141696000, 141782400, 141868800, 141955200, 142041600, 
142128000, 142214400, 142300800, 142387200, 142473600, 142560000, 
142646400, 142732800, 142819200, 142905600, 142992000, 143078400, 
143164800, 143251200, 143337600, 143424000, 143510400, 143596800, 
143683200, 143769600, 143856000, 143942400, 144028800, 144115200, 
144201600, 144288000, 144374400, 144460800, 144547200, 144633600, 
144720000, 144806400, 144892800, 144979200, 145065600, 145152000, 
145238400, 145324800, 145411200, 145497600, 145584000, 145670400, 
145756800, 145843200, 145929600, 146016000, 146102400, 146188800, 
146275200, 146361600, 146448000, 146534400, 146620800, 146707200, 
146793600, 146880000, 146966400, 147052800, 147139200, 147225600, 
147312000, 147398400, 147484800, 147571200, 147657600, 147744000, 
147830400, 147916800, 148003200, 148089600, 148176000, 148262400, 
148348800, 148435200, 148521600, 148608000, 148694400, 148780800, 
148867200, 148953600, 149040000, 149126400, 149212800, 149299200, 
149385600, 149472000, 149558400, 149644800, 149731200, 149817600, 
149904000, 149990400, 150076800, 150163200, 150249600, 150336000, 
150422400, 150508800, 150595200, 150681600, 150768000, 150854400, 
150940800, 151027200, 151113600, 151200000, 151286400, 151372800, 
151459200, 151545600, 151632000, 151718400, 151804800, 151891200, 
151977600, 152064000, 152150400, 152236800, 152323200, 152409600, 
152496000, 152582400, 152668800, 152755200, 152841600, 152928000, 
153014400, 153100800, 153187200, 153273600, 153360000, 153446400, 
153532800, 153619200, 153705600, 153792000, 153878400, 153964800, 
154051200, 154137600, 154224000, 154310400, 154396800, 154483200, 
154569600, 154656000, 154742400, 154828800, 154915200, 155001600, 
155088000, 155174400, 155260800, 155347200, 155433600, 155520000, 
155606400, 155692800, 155779200, 155865600, 155952000, 156038400, 
156124800, 156211200, 156297600, 156384000, 156470400, 156556800, 
156643200, 156729600, 156816000, 156902400, 156988800, 157075200, 
157161600, 157248000, 157334400, 157420800, 157507200, 157593600, 
157680000), y = c(4.65, 4.65, 4.65, 4.48, 5.16, 5.52, 5.34, 5.34, 
4.82, 4.65, 4.48, 4.31, 4.31, 4.31, 4.14, 3.82, 3.98, 3.98, 4.31, 
5.71, 6.5, 6.3, 5.71, 5.71, 5.16, 4.65, 4.14, 3.98, 4.48, 4.48, 
4.31, 4.65, 4.31, 3.98, 3.98, 3.98, 3.98, 3.98, 3.98, 3.82, 3.67, 
3.67, 3.98, 3.98, 3.82, 3.82, 3.82, 4.14, 5.9, 4.48, 3.98, 3.98, 
3.82, 3.67, 3.67, 3.67, 4.65, 3.98, 4.31, 4.31, 3.67, 4.31, 6.1, 
7.3, 7.5, 7.5, 8.14, 10.8, 16.1, 14.8, 12.5, 9.9, 8.14, 6.9, 
6.1, 5.34, 5.16, 4.99, 4.99, 4.99, 4.99, 5.52, 6.3, 7.3, 6.9, 
5.9, 5.71, 5.71, 8.58, 31.5, 33.7, 18.4, 11.3, 16.1, 32.9, 45.3, 
54, 25.7, 18, 15.9, 15.6, 14.5, 15.9, 35.9, 37.5, 29.4, 27.5, 
30.1, 27.5, 30.8, 29.4, 22, 20.1, 35.9, 36.7, 32.9, 22, 18, 15.9, 
15.2, 14.8, 13, 12.7, 12.5, 11, 9.68, 8.8, 7.92, 7.3, 6.9, 7.3, 
10.3, 11, 11.3, 11.9, 12.5, 13.6, 12.2, 10.8, 9.9, 9.46, 8.8, 
7.5, 7.1, 7.71, 7.1, 6.1, 5.34, 5.34, 5.34, 5.52, 5.52, 6.3, 
6.7, 6.5, 5.9, 5.71, 5.9, 5.71, 5.52, 7.3, 7.5, 7.1, 7.3, 6.7, 
6.9, 7.3, 7.5, 10.8, 11.6, 8.58, 7.92, 7.1, 6.7, 6.5, 6.1, 5.9, 
5.9, 5.71, 5.52, 5.52, 5.52, 5.9, 5.9, 5.71, 5.52, 5.52, 5.34, 
5.34, 5.52, 6.5, 6.5, 5.71, 5.34, 5.16, 4.99, 4.82, 4.82, 4.99, 
4.82, 4.82, 4.82, 4.82, 4.82, 4.65, 4.48, 4.48, 4.31, 4.31, 4.14, 
4.14, 4.31, 4.48, 4.31, 4.31, 4.31, 4.99, 5.71, 6.3, 6.1, 6.1, 
5.9, 5.71, 5.52, 5.52, 5.52, 5.52, 5.52, 5.34, 5.34, 5.52, 5.52, 
5.52, 5.34, 5.34, 5.52, 5.34, 5.52, 5.52, 5.34, 5.34, 5.34, 5.34, 
5.71, 5.9, 6.3, 6.9, 7.5, 6.5, 6.1, 6.1, 5.9, 6.1, 6.1, 5.9, 
6.5, 6.5, 6.1, 5.9, 5.9, 5.71, 5.9, 5.9, 5.71, 4.99, 4.65, 5.16, 
5.34, 5.34, 4.65, 4.99, 5.71, 5.34, 5.34, 5.34, 5.34, 4.99, 5.34, 
5.34, 5.34, 5.34, 5.52, 5.34, 5.52, 5.71, 6.5, 7.71, 6.9, 6.5, 
6.7, 6.1, 5.9, 6.1, 5.9, 5.71, 7.92, 7.71, 7.1, 7.92, 5.34, 5.16, 
8.14, 10.1, 7.92, 7.3, 6.9, 6.9, 6.9, 8.58, 7.3, 6.9, 7.3, 6.3, 
5.16, 6.1, 5.52, 4.99, 5.34, 5.34, 5.34, 5.16, 5.71, 5.52, 5.52, 
5.16, 4.82, 5.52, 6.1, 5.9, 5.71, 5.52, 5.16, 4.99, 4.48, 4.82, 
5.16, 5.16, 5.16, 5.16, 5.16, 4.82, 4.65, 3.82, 4.14, 4.65, 4.65, 
4.31, 4.31, 5.16, 5.16, 5.16, 5.16, 5.16, 4.99, 4.65, 5.16, 5.16, 
5.16, 5.16, 5.16, 5.16, 5.16, 5.16, 5.34, 5.34)), .Names = c("x", 
"y"), row.names = c(NA, -365L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x0000000005860788>)

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United         States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.6 readxl_0.1.0     mgcv_1.8-7       nlme_3.1-121         scales_0.3.0     sos_1.3-8        brew_1.0-6       ggplot2_1.0.1   
[9] MASS_7.3-43     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.1      lattice_0.20-33  digest_0.6.8     chron_2.3-47     grid_3.2.2       plyr_1.8.3       gtable_0.1.2     magrittr_1.5    
 [9] stringi_0.5-5    reshape2_1.4.1   Matrix_1.2-2     labeling_0.3     proto_0.3-10     tools_3.2.2      stringr_1.0.0    munsell_0.4.2   
[17] colorspace_1.2-6

Partial Solution

I still don't really know why the two methods are giving different answers, and that bothers me. However, after much internet searching, I did find the following workaround:

library(readxl)
library(data.table)
library(ggplot2)
library(scales)
library(mgcv)

stackOF_data <- read_excel("C:/Users/jel4049/Desktop/mean-daily-flow-cumecs-    vatnsdals.xlsx", sheet = "Data")
stackOF_data <- data.table(stackOF_data)

stackOF_data <- stackOF_data[,.(timeseries=as.POSIXct(Date,format("%Y-%m-%d")),mdf)]
a <- stackOF_data[,.(x=as.numeric(timeseries),y=mdf)]
a1 <- gam(y~s(x, k=100, bs="cs"),data=a) 
a2=data.table(gam_mdf = predict(a1,a))

preds <- predict(a1,se.fit=TRUE)
my_data <- data.frame(mu=preds$fit, low =(preds$fit - 1.96 * preds$se.fit), high = (preds$fit + 1.96 * preds$se.fit))


m <- ggplot()+
  geom_line(data = a2, aes(x=stackOF_data$timeseries, y=gam_mdf), size=1, col="blue")+
  geom_smooth(data=my_data,aes(ymin = low, ymax = high, x=stackOF_data$timeseries, y = mu), stat = "identity", col="green")
m

Now at least I know that the summary and data fit quality info I can get from some of the mgcv functions match my plots.

like image 810
CJ9 Avatar asked Oct 09 '15 14:10

CJ9


1 Answers

It turns out the differences you were seeing was because you were using the default for the n argument in stat_smooth.

From the help page:

n number of points to evaluate smoother at

Of course, it didn't jump out at me right away that this meant n controls the size of the dataset for the newdata argument in predict and therefore stat_smooth doesn't use the original dataset when making the predictions. But I was reading this nice answer on a different stat_smooth question and realized that to figure out what was going on I should take a closer look at the stat_smooth predictions vs manual predictions from a fitted gam model.

So, using your dataset from your OP, which I named dat, we can check what's going on.

The plot when k = 100, after fitting the model via gam and adding the predictions to the dataset. As you noted, the blue (stat_smooth) and black (manual predictions) don't match.

dat$predgam = predict(gam(y ~ s(x, k = 100), data = dat))

(p1 = ggplot(dat, aes(x, y)) +
    geom_point() +
    geom_smooth(method = "gam", formula = y ~ s(x, k = 100)) +
    geom_line(aes(y = predgam)))

enter image description here

You can always use ggplot_build to look at your plot object and see all the pieces that make it up (I'm not showing the results here because it takes up so much space, but the output will print to your Console).

ggplot_build(p1)

The prediction dataset for stat_smooth is the second in the list of datasets.

ggplot_build(p1)$data[[2]]

But look how many rows that dataset has:

nrow(ggplot_build(p1)$data[[2]])
[1] 80

The default setting for the n argument is 80, but you have 365 rows in your dataset. So what happens if you change n to 365? I'll make the smooth line fatter so you can actually see it (in blue).

(p2 = ggplot(dat, aes(x, y)) +
    geom_point() +
    geom_smooth(method = "gam", formula = y ~ s(x, k = 100), n = 365, size = 2) +
    geom_line(aes(y = predgam)))

enter image description here

nrow(ggplot_build(p2)$data[[2]])
[1] 365

If you look at the code for the predictdf function mentioned in the Details section of the stat_smooth help page you'll see that the original dataset isn't used when making predictions. Instead, a sequence is made from the original explanatory variable. This is something that can be really important when working with a small dataset and you need more prediction points in order for the line to look smooth. In your case, though, the original dataset is already a nice smooth sequence of x so using n = 365 gets the same predictions from stat_smooth as the original dataset does.

You can see the code for predictdf here.

like image 169
aosmith Avatar answered Oct 22 '22 20:10

aosmith