R glm standard error estimate differences to SAS PROC GENMOD

Tags:

I am converting a SAS PROC GENMOD example into R, using glm in R. The SAS code was:

proc genmod data=data0 namelen=30;
model boxcoxy=boxcoxxy ~ AGEGRP4 + AGEGRP5 + AGEGRP6 + AGEGRP7 + AGEGRP8 + RACE1 + RACE3 + WEEKEND + 
SEQ/dist=normal;
FREQ REPLICATE_VAR;  
run;

My R code is:

parmsg2 <- glm(boxcoxxy ~ AGEGRP4 + AGEGRP5 + AGEGRP6 + AGEGRP7 + AGEGRP8 + RACE1 + RACE3 + WEEKEND + 
SEQ , data=data0, family=gaussian, weights = REPLICATE_VAR)

When I use summary(parmsg2) I get the same coefficient estimates as in SAS, but my standard errors are wildly different.

The summary output from SAS is:

Name         df   Estimate      StdErr    LowerWaldCL  UpperWaldCL      ChiSq   ProbChiSq
Intercept    1   6.5007436    .00078884      6.4991975    6.5022897    67911982 0
agegrp4      1   .64607262    .00105425      .64400633    .64813891   375556.79 0
agegrp5      1    .4191395    .00089722      .41738099    .42089802   218233.76 0
agegrp6      1  -.22518765    .00083118     -.22681672   -.22355857   73401.113 0
agegrp7      1  -1.7445189    .00087569     -1.7462352   -1.7428026   3968762.2 0
agegrp8      1  -2.2908855    .00109766     -2.2930369   -2.2887342   4355849.4 0
race1        1  -.13454883    .00080672     -.13612997   -.13296769    27817.29 0
race3        1  -.20607036    .00070966     -.20746127   -.20467944   84319.131 0
weekend      1    .0327884    .00044731       .0319117    .03366511   5373.1931 0
seq2          1 -.47509583    .00047337     -.47602363   -.47416804   1007291.3 0
Scale         1 2.9328613     .00015586      2.9325559    2.9331668     -127

The summary output from R is:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.50074    0.10354  62.785  < 2e-16 
AGEGRP4      0.64607    0.13838   4.669 3.07e-06 
AGEGRP5      0.41914    0.11776   3.559 0.000374 
AGEGRP6     -0.22519    0.10910  -2.064 0.039031  
AGEGRP7     -1.74452    0.11494 -15.178  < 2e-16
AGEGRP8     -2.29089    0.14407 -15.901  < 2e-16
RACE1       -0.13455    0.10589  -1.271 0.203865    
RACE3       -0.20607    0.09315  -2.212 0.026967 
WEEKEND      0.03279    0.05871   0.558 0.576535 
SEQ         -0.47510    0.06213  -7.646 2.25e-14

The importance of the difference in the standard errors is that the SAS coefficients are all statistically significant, but the RACE1 and WEEKEND coefficients in the R output are not. I have found a formula to calculate the Wald confidence intervals in R, but this is pointless given the difference in the standard errors, as I will not get the same results.

Apparently SAS uses a ridge-stabilized Newton-Raphson algorithm for its estimates, which are ML. The information I read about the glm function in R is that the results should be equivalent to ML. What can I do to change my estimation procedure in R so that I get the equivalent coefficents and standard error estimates that were produced in SAS?

To update, thanks to Spacedman's answer, I used weights because the data are from individuals in a dietary survey, and REPLICATE_VAR is a balanced repeated replication weight, that is an integer (and quite large, in the order of 1000s or 10000s). The website that describes the weight is here. I don't know why the FREQ rather than the WEIGHT command was used in SAS. I will now test by expanding the number of observations using REPLICATE_VAR and rerunning the analysis.

Thanks to Ben's answer below, the code I am using now is:

parmsg2 <- coef(summary(glm(boxcoxxy ~ AGEGRP4 + AGEGRP5 + AGEGRP6 + AGEGRP7 + AGEGRP8 + RACE1 + RACE3 
+ WEEKEND + SEQ , data=data0, family=gaussian, weights = REPLICATE_VAR)))
#clean up the standard errors
parmsg2[,"Std. Error"] <- parmsg2[,"Std. Error"]/sqrt(mean(data0$REPLICATE_VAR)) 
parmsg2[,"t value"] <- parmsg2[,"Estimate"]/parmsg2[,"Std. Error"] 
#note: using the t-distribution for p-values, correct the t-values
allsummary <- summary.glm(glm(boxcoxxy ~ AGEGRP4 + AGEGRP5 + AGEGRP6 + AGEGRP7 + AGEGRP8 + RACE1 +
RACE3 + WEEKEND + SEQ , data=data0, family=gaussian, weights = REPLICATE_VAR))
parmsg2[,"Pr(>|t|)"] <- 2*pt(-abs(parmsg2[,"t value"]),df=allsummary$df.resid)

695

asked Nov 27 '11 22:11

Michelle

2 Answers

FREQ in SAS is not the same as weights in R's glm. In SAS, its the number of occurrences of that event. For R, its "that each response y_i is the mean of w_i unit-weight observations". These two things are not the same.

If you want R to give the same output as SAS (can't think why) then you probably need to repeat each row in your data frame 'weight' number of times.

Here, data is 10 rows with all weights=2, and data2 is 20 rows (2 copies of each row of data) with all weights=1:

> summary(glm(y~x,data=data2,weights=weights))$coef
              Estimate Std. Error   t value   Pr(>|t|)
(Intercept) 0.32859847 0.13413683 2.4497259 0.02475748
x           0.01540002 0.02161811 0.7123667 0.48537003
> summary(glm(y~x,data=data,weights=weights))$coef
              Estimate Std. Error   t value  Pr(>|t|)
(Intercept) 0.32859847 0.20120525 1.6331506 0.1410799
x           0.01540002 0.03242716 0.4749111 0.6475449

Handwaving a bit, N observations of the same value have less fuzziness than saying this observation is the mean of N observations, so the SE with the repeated observations will have a smaller SE than the mean one.

103

answered Nov 15 '22 21:11

Spacedman

edit: reading the SAS documentation for FREQ and your responses above and below, here's what I think you should try: use weights=REPLICATE_VAR in the glm statement to adjust the relative weighting of the groups (the equality of coefficients you found above suggests that this is the right way to go), then use N=sum(REPLICATE_VAR) in the adjustment suggested below (I also think you could be using lm instead of glm for this problem ... it won't make much difference but should be a tiny bit faster and more robust.) Something like:

s <- coef(summary(lm(y~x,data=data2, weights=REPLICATE_VAR)))
s[,"Std. Error"] <- s[,"Std. Error"]/sqrt(sum(data2$REPLICATE_VAR))
s[,"t value"] <- s[,"Estimate"]/s[,"Std. Error"]
s[,"Pr(>|t|)"] <- 2*pt(abs(s[,"t value"]),df=g$df.resid)

answered Nov 15 '22 21:11

Ben Bolker

Related questions
                            
                                What is setReplaceMethod() and how does it work?
                            
                                Strange formatting of legend in ggplotly in R
                            
                                Calculate distance longitude latitude of multiple in dataframe R
                            
                                Using flextable in r markdown loop not producing tables
                            
                                ggplot dotplot: What is the proper use of geom_dotplot?
                            
                                dplyr: mutate_at + coalesce: dynamic names of columns
                            
                                Fastest way to read large Excel xlsx files? To parallelize or not?
                            
                                Replace a sequence of values by group depending on preceeding values
                            
                                How to keep linetype spacing constant despite line size
                            
                                R: Which is the optimal way to compute functions over time with 3D arrays (latitude, longitude, and time)?
                            
                                Greek letters in ggplot strip text
                            
                                What are good example R packages written using RUnit or roxygen?
                            
                                Clear R Console programmatically [duplicate]
                            
                                Is the FoldLeft function available in R?
                            
                                How does one install 'rj' in StatET plugin for Eclipse?
                            
                                XPath within R using XML package
                            
                                easiest way to write a title page to pdf without Sweave
                            
                                Combine two data frames and remove duplicate columns
                            
                                Methods for doing heatmaps, level / contour plots, and hexagonal binning
                            
                                3D Plotting in R - Using a 4th dimension of color

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With