I am trying to fit a lineal model with some categorical variables
model <- lm(price ~ carat+cut+color+clarity)
summary(model)
The answer is:
Call:
lm(formula = price ~ carat + cut + color + clarity)
Residuals:
Min 1Q Median 3Q Max
-11495.7 -688.5 -204.1 458.2 9305.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3696.818 47.948 -77.100 < 2e-16 ***
carat 8843.877 40.885 216.311 < 2e-16 ***
cut.L 755.474 68.378 11.049 < 2e-16 ***
cut.Q -349.587 60.432 -5.785 7.74e-09 ***
cut.C 200.008 52.260 3.827 0.000131 ***
cut^4 12.748 42.642 0.299 0.764994
color.L 1905.109 61.050 31.206 < 2e-16 ***
color.Q -675.265 56.056 -12.046 < 2e-16 ***
color.C 197.903 51.932 3.811 0.000140 ***
color^4 71.054 46.940 1.514 0.130165
color^5 2.867 44.586 0.064 0.948729
color^6 50.531 40.771 1.239 0.215268
clarity.L 4045.728 108.363 37.335 < 2e-16 ***
clarity.Q -1545.178 102.668 -15.050 < 2e-16 ***
clarity.C 999.911 88.301 11.324 < 2e-16 ***
clarity^4 -665.130 66.212 -10.045 < 2e-16 ***
clarity^5 920.987 55.012 16.742 < 2e-16 ***
clarity^6 -712.168 52.346 -13.605 < 2e-16 ***
clarity^7 1008.604 45.842 22.002 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1167 on 4639 degrees of freedom
Multiple R-squared: 0.9162, Adjusted R-squared: 0.9159
F-statistic: 2817 on 18 and 4639 DF, p-value: < 2.2e-16
But I don't understand why the answers are with ".L,.Q,.C,^4, ...", something is wrong but I don't know what is wrong, I already tried with the function factor for each variable.
Since @Roland didn't post what he he thought would be a better approach (and I kind of agreed with him), I needed to educate myself on how a real statistician would do this in R. I eventually found the right coding advice on SO in a post by @SvenHohenstein: How to properly set contrasts in R The reason I like to use as.numeric
is that I know how to interpret the coefficients. The coefficient is the 'effect' (remembering that the work 'effect' does not imply causation) of a one level difference in level on the LHS-outcome or dependent variable. Looking at my first answer which at the moment is above this one, you see values around 1000 for the coefficients of cut2-5 and no value for cut1. The contribution of the "value" for cut==1 is buried inside the '(Intercept)'. The estimates look like:
> cbind( levels(diamonds$cut), c(coef(model.cut)[grep('Intercept|cut', names(coef(model.cut)))] ))
[,1] [,2]
(Intercept) "Fair" "-5189.46034442502"
cut2 "Good" "909.432743872746"
cut3 "Very Good" "1129.51839934007"
cut4 "Premium" "1156.98898349819"
cut5 "Ideal" "1264.12800574865"
You could plot the unadjusted means, but the unadjusted values don't really make sense, (thus emphasizing the need for regression analyses):
> with(diamonds, tapply(price, cut, mean))
Fair Good Very Good Premium Ideal
4358.758 3928.864 3981.760 4584.258 3457.542
So look at effect of cut within quintiles of carat
:
> with(diamonds, tapply(price, list(cut, cut2(carat, g=5) ), mean))
[0.20,0.36) [0.36,0.54) [0.54,0.91) [0.91,1.14) [1.14,5.01]
Fair 802.4528 1193.162 2336.543 4001.972 8682.351
Good 574.7482 1101.406 2701.412 4872.072 9788.294
Very Good 597.9258 1151.537 2727.251 5464.223 10158.057
Premium 717.1096 1149.550 2537.446 5214.787 10131.999
Ideal 739.8972 1254.229 2624.180 6050.358 10317.725
So an effect of ... what? maybe an average of 800 across the full range of values of 'cut' for a two-way analysis?
contrasts(diamonds$cut, how.many=1) <- poly(1:5)
> model.cut2 <- lm(price ~ carat+cut, diamonds)
> model.cut2
Call:
lm(formula = price ~ carat + cut, data = diamonds)
Coefficients:
(Intercept) carat cut1
-2555.1 7838.5 815.8
> contrasts(diamonds$cut)
1
Fair -0.6324555
Good -0.3162278
Very Good 0.0000000
Premium 0.3162278
Ideal 0.6324555
The average difference in estimated price holding carat
constant for Fair versus Ideal would be ( -0.6324555 -0.6324555)*815.8 or a price difference of minus 1031.91 (dollars or euros.... whatever the units of the price variable)
I've decide to remove a bunch of other stuff I was going to put in here, because I think this adequately demonstrates my essential point that one needs to understand the underlying coding in order to properly interpret and communicate the magnitude of "effects". The coefficients alone are not meaningful. The linear contrasts from poly
create an effect-coefficient that is essentially for a "full" range difference. One needs to do comparisons using both the contrast matrix values and the estimated coefficients if using R poly()
. The range of contrasts are typically around 1 and linear contrasts are centered on 0.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With