I'm working with data that consists of a few dozen binary features about people which basically come down to "person has feature x" [True/False].
From what I can find online categorical data should be one-hot encoded instead of assigning an arbitrary value for each category because you can't say "category 1 is less than category 2". So the solution is to create a dummy variable for each category:
Cat || dummy 1 | dummy 2 | dummy 3
____||_________|_________|________
1 || 1 | 0 | 0
2 || 0 | 1 | 0
3 || 0 | 0 | 1
Now for binary features one can choose between using the variable directly (1 for true, 0 for false) or use two dummy variables ((1, 0) for true, (0, 1) for false.). But I can't find any sources that show/explain what the best approach is.
I myself am conflicted, because on one hand, the dummy variables reduces the importance of each individual variable and it is show that at least in some cases the accuracy of the model suffers, source. But on the other hand, this can also encode missing data (in the form of (0, 0)). Furthermore, is it possible to say "False is less than True"?
I'm actually using a Random Forest in python, and I know that tree-based classifiers such as Random Forests support categorial data, but the Sklearn package hasn't implemented this yet.
I wrote a small test on the Sklearn digits data set. This data set has a number of 8 by 8 images of digits (0-9), each pixel has a value between 0 and 16 and a simple model can use this to learn to recognize the digits.
For my test I change the values of > 8 to True and <= 8 to False. The accuracy ofcourse suffers when compared to the original data, but when I implement one-hot encoding, thus changing True to (1, 0) and False to (0, 1) I can't find a significant difference compared to the binary encoding.
An explanation of the recommended approach would be greatly appreciated!
Converting a binary variable that takes the values of [0, 1] into a one-hot encoded of [(0, 1), (1, 0)] is redundant and not recommended for the following reasons (some of them are already mentioned in the comment above but just to expand on this):
It is redundant because the binary variable is already in a form similar to the one-hot encoded, where the last column is dropped as it does not make any difference with or without it, because it can be inferred from the first given column: If I give you [(0, ), (1,)], you can know the complementary column [(, 1), (, 0)].
Suppose you have more than one binary variable, say 4 for example. If you convert them into one-hot encoded form, the dimension will increase from 4 to 8. The latter is not recommended for the following reasons:
The Curse of Dimensionality: High dimensional data can be so troublesome. That's because a lot of algorithms (e.g. clustering algorithms) use the Euclidean Distance which, due to the squared terms, is sensitive to noise. As a matter of fact, data points spread too thin as the dimensions increase, making data extremely noisy. Besides, the concept of neighborhood becomes meaningless, and approaches that are based on finding the relative contrast between distances of the data points become unreliable.
Time & Memory Complexity: It is intuitive that increasing the number of features will cost the algorithm more execution time and memory space requirement. To name a few, algorithms that use the Covariance Matrix in its computation will get affected. Polynomial algorithms will end up with too many terms...and so on. In general, the learning usually is faster with less features especially if the extra features are redundant.
Multi-Collinearity: Since the last column in the one-hot encoded form of the binary variable is redundant and 100% correlated with the first column, this will cause troubles to the Linear Regression-based Algorithms. For example, since the ordinary least squares estimates involve inverting the matrix, a computer algorithm may be unsuccessful in obtaining an approximate inverse, if a lot of features are correlated, and hence the inverse may be numerically inaccurate. Also, linear models work by observing the changes in the dependent variable y
with the unit changes in one independent variable after holding all other independent variables as constants, yet in case independent variables are highly correlated, the latter fails (there are more other consequences of Multi-Collinearity) (although some other algorithms might be less sensitive to this as in Decision Trees).
Overfitting-prone: In general, too many features (regardless if they're correlated or not) may overfit your model and fail to generalize to new examples, as every data point in your dataset will be fully identified by the given features (search Andrew NG lectures, he explained this in detail)
In a nutshell, converting a binary variable into a one-hot encoded one is redundant and may lead to troubles that are needless and unsolicited. Although correlated features may not always worsen your model, yet they will not always improve it either.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With