Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

linear regression using categories as features

I'm trying to put together a linear regression model but some of my featured are not numerical e.g. "Car Colour" whereas other are e.g. "Engine Size". In non-numerical cases I'm unsure on how to represent this when adding as an input feature. The only way i could think of doing this would be to represent each colour with a different value e.g. (red = 1, blue = 2, green = 3...) however this doesn't seem acceptable as this implies that green is "better" than red.

Can anybody help... I'm implementing this in Java so I'd appreciate an algorithms expressed in this language or to be language independent.

like image 210
raven-king Avatar asked Jul 29 '12 13:07

raven-king


1 Answers

One way to do this is to use dummy coding another technique is effect coding.

Please refer to this article for more detail, I think the author has explained better than what I can do here.

Coding Categorical Variables in Regression Models: Dummy and Effect Coding by Resmi Gupta

I guess this solution would fall into your language independent category ;)

To encode the car color (I'm assuming car color can take only 3 values: red, blue, green)

You can encode it as follows:

Color  Dummy_Var_One  Dummy_Var_Two

Red        1              0
Blue       0              1
Green      0              0 

In the above table Green will become reference level. In your case if your color takes n values you will need to include n-1 dummy variables.

An implementation in Java can be found in the Weka filter NominalToBinary, though this will create n variables for n categories.

like image 185
darshan Avatar answered Sep 24 '22 19:09

darshan