My question is i want to run feature selection on the data with several categorical variables. I have used get_dummies
in pandas
to generate all the sparse matrix for these categorical variables. My question is how sklearn knows that one specific sparse matrix actually belongs to one feature and select/drop them all? For example, I have a variable called city. There are New York, Chicago and Boston three levels for that variable, so the sparse matrix looks like:
[1,0,0]
[0,1,0]
[0,0,1]
How can I inform the sklearn that in these three "columns" actually belong to one feature, which is city and won't end up with choosing New York, and delete Chicago and Boston?
Thank you so much!
You can't. The feature selection routines in scikit-learn will consider the dummy variables independently of each other. This means they can "trim" the domains of categorical variables down to the values that matter for prediction.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With