Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linear model with categorical variables in R

Tags:

r

lm

factors

I am trying to fit a lineal model with some categorical variables

model <- lm(price ~ carat+cut+color+clarity)
summary(model)

The answer is:

Call:
lm(formula = price ~ carat + cut + color + clarity)

Residuals:
     Min       1Q   Median       3Q      Max 
-11495.7   -688.5   -204.1    458.2   9305.3 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3696.818     47.948 -77.100  < 2e-16 ***
carat        8843.877     40.885 216.311  < 2e-16 ***
cut.L         755.474     68.378  11.049  < 2e-16 ***
cut.Q        -349.587     60.432  -5.785 7.74e-09 ***
cut.C         200.008     52.260   3.827 0.000131 ***
cut^4          12.748     42.642   0.299 0.764994    
color.L      1905.109     61.050  31.206  < 2e-16 ***
color.Q      -675.265     56.056 -12.046  < 2e-16 ***
color.C       197.903     51.932   3.811 0.000140 ***
color^4        71.054     46.940   1.514 0.130165    
color^5         2.867     44.586   0.064 0.948729    
color^6        50.531     40.771   1.239 0.215268    
clarity.L    4045.728    108.363  37.335  < 2e-16 ***
clarity.Q   -1545.178    102.668 -15.050  < 2e-16 ***
clarity.C     999.911     88.301  11.324  < 2e-16 ***
clarity^4    -665.130     66.212 -10.045  < 2e-16 ***
clarity^5     920.987     55.012  16.742  < 2e-16 ***
clarity^6    -712.168     52.346 -13.605  < 2e-16 ***
clarity^7    1008.604     45.842  22.002  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1167 on 4639 degrees of freedom
Multiple R-squared:  0.9162,    Adjusted R-squared:  0.9159 
F-statistic:  2817 on 18 and 4639 DF,  p-value: < 2.2e-16

But I don't understand why the answers are with ".L,.Q,.C,^4, ...", something is wrong but I don't know what is wrong, I already tried with the function factor for each variable.

like image 740
Rashid Dergal Rufeil Avatar asked Dec 09 '22 03:12

Rashid Dergal Rufeil


1 Answers

Since @Roland didn't post what he he thought would be a better approach (and I kind of agreed with him), I needed to educate myself on how a real statistician would do this in R. I eventually found the right coding advice on SO in a post by @SvenHohenstein: How to properly set contrasts in R The reason I like to use as.numeric is that I know how to interpret the coefficients. The coefficient is the 'effect' (remembering that the work 'effect' does not imply causation) of a one level difference in level on the LHS-outcome or dependent variable. Looking at my first answer which at the moment is above this one, you see values around 1000 for the coefficients of cut2-5 and no value for cut1. The contribution of the "value" for cut==1 is buried inside the '(Intercept)'. The estimates look like:

> cbind( levels(diamonds$cut), c(coef(model.cut)[grep('Intercept|cut', names(coef(model.cut)))] ))
            [,1]        [,2]               
(Intercept) "Fair"      "-5189.46034442502"
cut2        "Good"      "909.432743872746" 
cut3        "Very Good" "1129.51839934007" 
cut4        "Premium"   "1156.98898349819" 
cut5        "Ideal"     "1264.12800574865" 

You could plot the unadjusted means, but the unadjusted values don't really make sense, (thus emphasizing the need for regression analyses):

> with(diamonds, tapply(price, cut, mean))
     Fair      Good Very Good   Premium     Ideal 
 4358.758  3928.864  3981.760  4584.258  3457.542 

So look at effect of cut within quintiles of carat:

> with(diamonds, tapply(price, list(cut, cut2(carat, g=5) ), mean))
          [0.20,0.36) [0.36,0.54) [0.54,0.91) [0.91,1.14) [1.14,5.01]
Fair         802.4528    1193.162    2336.543    4001.972    8682.351
Good         574.7482    1101.406    2701.412    4872.072    9788.294
Very Good    597.9258    1151.537    2727.251    5464.223   10158.057
Premium      717.1096    1149.550    2537.446    5214.787   10131.999
Ideal        739.8972    1254.229    2624.180    6050.358   10317.725

So an effect of ... what? maybe an average of 800 across the full range of values of 'cut' for a two-way analysis?

contrasts(diamonds$cut, how.many=1) <- poly(1:5)
> model.cut2 <- lm(price ~ carat+cut, diamonds)
> model.cut2

Call:
lm(formula = price ~ carat + cut, data = diamonds)

Coefficients:
(Intercept)        carat         cut1  
    -2555.1       7838.5        815.8  

> contrasts(diamonds$cut)
                   1
Fair      -0.6324555
Good      -0.3162278
Very Good  0.0000000
Premium    0.3162278
Ideal      0.6324555

The average difference in estimated price holding carat constant for Fair versus Ideal would be ( -0.6324555 -0.6324555)*815.8 or a price difference of minus 1031.91 (dollars or euros.... whatever the units of the price variable)

I've decide to remove a bunch of other stuff I was going to put in here, because I think this adequately demonstrates my essential point that one needs to understand the underlying coding in order to properly interpret and communicate the magnitude of "effects". The coefficients alone are not meaningful. The linear contrasts from poly create an effect-coefficient that is essentially for a "full" range difference. One needs to do comparisons using both the contrast matrix values and the estimated coefficients if using R poly(). The range of contrasts are typically around 1 and linear contrasts are centered on 0.

like image 93
IRTFM Avatar answered Dec 10 '22 17:12

IRTFM