I am converting a SAS PROC GENMOD example into R, using glm in R. The SAS code was:
proc genmod data=data0 namelen=30;
model boxcoxy=boxcoxxy ~ AGEGRP4 + AGEGRP5 + AGEGRP6 + AGEGRP7 + AGEGRP8 + RACE1 + RACE3 + WEEKEND +
SEQ/dist=normal;
FREQ REPLICATE_VAR;
run;
My R code is:
parmsg2 <- glm(boxcoxxy ~ AGEGRP4 + AGEGRP5 + AGEGRP6 + AGEGRP7 + AGEGRP8 + RACE1 + RACE3 + WEEKEND +
SEQ , data=data0, family=gaussian, weights = REPLICATE_VAR)
When I use summary(parmsg2)
I get the same coefficient estimates as in SAS, but my standard errors are wildly different.
The summary output from SAS is:
Name df Estimate StdErr LowerWaldCL UpperWaldCL ChiSq ProbChiSq
Intercept 1 6.5007436 .00078884 6.4991975 6.5022897 67911982 0
agegrp4 1 .64607262 .00105425 .64400633 .64813891 375556.79 0
agegrp5 1 .4191395 .00089722 .41738099 .42089802 218233.76 0
agegrp6 1 -.22518765 .00083118 -.22681672 -.22355857 73401.113 0
agegrp7 1 -1.7445189 .00087569 -1.7462352 -1.7428026 3968762.2 0
agegrp8 1 -2.2908855 .00109766 -2.2930369 -2.2887342 4355849.4 0
race1 1 -.13454883 .00080672 -.13612997 -.13296769 27817.29 0
race3 1 -.20607036 .00070966 -.20746127 -.20467944 84319.131 0
weekend 1 .0327884 .00044731 .0319117 .03366511 5373.1931 0
seq2 1 -.47509583 .00047337 -.47602363 -.47416804 1007291.3 0
Scale 1 2.9328613 .00015586 2.9325559 2.9331668 -127
The summary output from R is:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.50074 0.10354 62.785 < 2e-16
AGEGRP4 0.64607 0.13838 4.669 3.07e-06
AGEGRP5 0.41914 0.11776 3.559 0.000374
AGEGRP6 -0.22519 0.10910 -2.064 0.039031
AGEGRP7 -1.74452 0.11494 -15.178 < 2e-16
AGEGRP8 -2.29089 0.14407 -15.901 < 2e-16
RACE1 -0.13455 0.10589 -1.271 0.203865
RACE3 -0.20607 0.09315 -2.212 0.026967
WEEKEND 0.03279 0.05871 0.558 0.576535
SEQ -0.47510 0.06213 -7.646 2.25e-14
The importance of the difference in the standard errors is that the SAS coefficients are all statistically significant, but the RACE1
and WEEKEND
coefficients in the R output are not. I have found a formula to calculate the Wald confidence intervals in R, but this is pointless given the difference in the standard errors, as I will not get the same results.
Apparently SAS uses a ridge-stabilized Newton-Raphson algorithm for its estimates, which are ML. The information I read about the glm
function in R is that the results should be equivalent to ML. What can I do to change my estimation procedure in R so that I get the equivalent coefficents and standard error estimates that were produced in SAS?
To update, thanks to Spacedman's answer, I used weights because the data are from individuals in a dietary survey, and REPLICATE_VAR
is a balanced repeated replication weight, that is an integer (and quite large, in the order of 1000s or 10000s). The website that describes the weight is here. I don't know why the FREQ
rather than the WEIGHT
command was used in SAS. I will now test by expanding the number of observations using REPLICATE_VAR and rerunning the analysis.
Thanks to Ben's answer below, the code I am using now is:
parmsg2 <- coef(summary(glm(boxcoxxy ~ AGEGRP4 + AGEGRP5 + AGEGRP6 + AGEGRP7 + AGEGRP8 + RACE1 + RACE3
+ WEEKEND + SEQ , data=data0, family=gaussian, weights = REPLICATE_VAR)))
#clean up the standard errors
parmsg2[,"Std. Error"] <- parmsg2[,"Std. Error"]/sqrt(mean(data0$REPLICATE_VAR))
parmsg2[,"t value"] <- parmsg2[,"Estimate"]/parmsg2[,"Std. Error"]
#note: using the t-distribution for p-values, correct the t-values
allsummary <- summary.glm(glm(boxcoxxy ~ AGEGRP4 + AGEGRP5 + AGEGRP6 + AGEGRP7 + AGEGRP8 + RACE1 +
RACE3 + WEEKEND + SEQ , data=data0, family=gaussian, weights = REPLICATE_VAR))
parmsg2[,"Pr(>|t|)"] <- 2*pt(-abs(parmsg2[,"t value"]),df=allsummary$df.resid)
The most different thing between GLM and GENMOD is estimating method is different. GLM is OLS, while GENMOD is MLE .
The GENMOD procedure enables you to perform exact logistic regression, also called exact conditional. binary logistic regression, and exact Poisson regression, also called exact conditional Poisson regression, by. specifying one or more EXACT statements.
FREQ in SAS is not the same as weights in R's glm. In SAS, its the number of occurrences of that event. For R, its "that each response y_i is the mean of w_i unit-weight observations". These two things are not the same.
If you want R to give the same output as SAS (can't think why) then you probably need to repeat each row in your data frame 'weight' number of times.
Here, data is 10 rows with all weights=2, and data2 is 20 rows (2 copies of each row of data) with all weights=1:
> summary(glm(y~x,data=data2,weights=weights))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.32859847 0.13413683 2.4497259 0.02475748
x 0.01540002 0.02161811 0.7123667 0.48537003
> summary(glm(y~x,data=data,weights=weights))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.32859847 0.20120525 1.6331506 0.1410799
x 0.01540002 0.03242716 0.4749111 0.6475449
Handwaving a bit, N observations of the same value have less fuzziness than saying this observation is the mean of N observations, so the SE with the repeated observations will have a smaller SE than the mean one.
edit: reading the SAS documentation for FREQ and your responses above and below, here's what I think you should try: use weights=REPLICATE_VAR
in the glm
statement to adjust the relative weighting of the groups (the equality of coefficients you found above suggests that this is the right way to go), then use N=sum(REPLICATE_VAR)
in the adjustment suggested below (I also think you could be using lm
instead of glm
for this problem ... it won't make much difference but should be a tiny bit faster and more robust.)
Something like:
s <- coef(summary(lm(y~x,data=data2, weights=REPLICATE_VAR)))
s[,"Std. Error"] <- s[,"Std. Error"]/sqrt(sum(data2$REPLICATE_VAR))
s[,"t value"] <- s[,"Estimate"]/s[,"Std. Error"]
s[,"Pr(>|t|)"] <- 2*pt(abs(s[,"t value"]),df=g$df.resid)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With