I'm trying to put together a linear regression model but some of my featured are not numerical e.g. "Car Colour" whereas other are e.g. "Engine Size". In non-numerical cases I'm unsure on how to represent this when adding as an input feature. The only way i could think of doing this would be to represent each colour with a different value e.g. (red = 1, blue = 2, green = 3...) however this doesn't seem acceptable as this implies that green is "better" than red.
Can anybody help... I'm implementing this in Java so I'd appreciate an algorithms expressed in this language or to be language independent.
One way to do this is to use dummy coding another technique is effect coding.
Please refer to this article for more detail, I think the author has explained better than what I can do here.
Coding Categorical Variables in Regression Models: Dummy and Effect Coding by Resmi Gupta
I guess this solution would fall into your language independent category ;)
To encode the car color (I'm assuming car color can take only 3 values: red, blue, green)
You can encode it as follows:
Color Dummy_Var_One Dummy_Var_Two
Red 1 0
Blue 0 1
Green 0 0
In the above table Green
will become reference level. In your case if your color takes n
values you will need to include n-1
dummy variables.
An implementation in Java can be found in the Weka filter NominalToBinary, though this will create n
variables for n
categories.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With