Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can sklearn select categorical features based on feature selection

My question is i want to run feature selection on the data with several categorical variables. I have used get_dummies in pandas to generate all the sparse matrix for these categorical variables. My question is how sklearn knows that one specific sparse matrix actually belongs to one feature and select/drop them all? For example, I have a variable called city. There are New York, Chicago and Boston three levels for that variable, so the sparse matrix looks like:

[1,0,0] [0,1,0] [0,0,1] How can I inform the sklearn that in these three "columns" actually belong to one feature, which is city and won't end up with choosing New York, and delete Chicago and Boston?

Thank you so much!

like image 856
MYjx Avatar asked Jul 29 '14 16:07

MYjx


1 Answers

You can't. The feature selection routines in scikit-learn will consider the dummy variables independently of each other. This means they can "trim" the domains of categorical variables down to the values that matter for prediction.

like image 55
Fred Foo Avatar answered Oct 16 '22 16:10

Fred Foo