Suppose I have a Pandas DataFrame like the below and I'm encoding categorical_1 for training in scikit-learn:
data = {'numeric_1':[12.1, 3.2, 5.5, 6.8, 9.9],
'categorical_1':['A', 'B', 'C', 'B', 'B']}
frame = pd.DataFrame(data)
dummy_values = pd.get_dummies(data['categorical_1'])
The values for 'categorical_1' are A, B, or C so I end up with 3 columns in dummy_values. However, categorical_1 can in reality take on values A, B, C, D, or E so there is no column represented for values D or E.
In R I would specify levels when factoring that column - is there a corresponding way to do this with Pandas or would I need to handle that manually?
In my mind this is necessary to account for test data with a value for that column outside of the values used in the training set, but being a novice in machine learning, perhaps that is not necessary so I'm open to a different way to approach this.
First, if you want pandas to take more values simply add them to the list sent to the get_dummies
method
data = {'numeric_1':[12.1, 3.2, 5.5, 6.8, 9.9],
'categorical_1':['A', 'B', 'C', 'B', 'B']}
frame = pd.DataFrame(data)
dummy_values = pd.get_dummies(data['categorical_1'] + ['D','E'])
as in python +
on lists works as a concatenate
operation, so
['A','B','C','B','B'] + ['D','E']
results in
['A', 'B', 'C', 'B', 'B', 'D', 'E']
In my mind this is necessary to account for test data with a value for that column outside of the values used in the training set, but being a novice in machine learning, perhaps that is not necessary so I'm open to a different way to approach this.
From the machine learning perspective, it is quite redundant. This column is a categorical one, so value 'D' means completely nothing to the model, that never seen it before. If you are coding the features unary (which I assume after seeing that you create columns for each value) it is enough to simply represent these 'D', 'E' values with
A B C
0 0 0
(i assume that you represent the 'B' value with 0 1 0
, 'C' with 0 0 1
etc.)
because if there were no such values in the training set, during testing - no model will distinguish between giving value 'D', or 'Elephant'
The only reason for such action would be to assume, that in the future you wish to add data with 'D' values, and simply do not want to modify the code, then it is reasonable to do it now, even though it could make training a bit more complex (as you add a dimension that as for now - carries completely no knowledge), but it seems a small problem.
If you are not going to encode it in the unary format, but rather want to use these values as one feature, simply with categorical values, then you would not need to create these "dummies" at all, and use a model which can work with such values, such as Naive Bayes, which could simply be trained with "Laplacian smoothing" to be able to work around non-existing values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With