linear regression using categories as features

Question

I'm trying to put together a linear regression model but some of my featured are not numerical e.g. "Car Colour" whereas other are e.g. "Engine Size". In non-numerical cases I'm unsure on how to represent this when adding as an input feature. The only way i could think of doing this would be to represent each colour with a different value e.g. (red = 1, blue = 2, green = 3...) however this doesn't seem acceptable as this implies that green is "better" than red.

Can anybody help... I'm implementing this in Java so I'd appreciate an algorithms expressed in this language or to be language independent.

darshan · Accepted Answer

One way to do this is to use dummy coding another technique is effect coding.

Please refer to this article for more detail, I think the author has explained better than what I can do here.

Coding Categorical Variables in Regression Models: Dummy and Effect Coding by Resmi Gupta

I guess this solution would fall into your language independent category ;)

To encode the car color (I'm assuming car color can take only 3 values: red, blue, green)

You can encode it as follows:

Color  Dummy_Var_One  Dummy_Var_Two

Red        1              0
Blue       0              1
Green      0              0

In the above table Green will become reference level. In your case if your color takes n values you will need to include n-1 dummy variables.

An implementation in Java can be found in the Weka filter NominalToBinary, though this will create n variables for n categories.

linear regression using categories as features

Tags:

java

algorithm

artificial-intelligence

machine-learning

raven-king

1 Answers

darshan

Recent Activity

Donate For Us

linear regression using categories as features

Tags:

java

algorithm

artificial-intelligence

machine-learning

raven-king

1 Answers

darshan

Related questions

Recent Activity

Donate For Us