I am using scikit learn to train a classification model. I have both discrete and continuous features in my training data. I want to do feature selection using maximum mutual information. If I have vectors x
and labels y
and the first three feature values are discrete I can get the MMI values like so:
mutual_info_classif(x, y, discrete_features=[0, 1, 2])
Now I'd like to use the same mutual information selection in a pipeline. I'd like to do something like this
SelectKBest(score_func=mutual_info_classif).fit(x, y)
but there's no way to pass the discrete features mask to SelectKBest
. Is there some syntax to do this that I'm overlooking, or do I have to write my own score function wrapper?
Unfortunately I could not find this functionality for the SelectKBest.
But what we can do easily is extend the SelectKBest
as our custom class to override the fit()
method which will be called.
This is the current fit()
method of SelectKBest (taken from source at github)
# No provision for extra parameters here
def fit(self, X, y):
X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)
....
....
# Here only the X, y are passed to scoring function
score_func_ret = self.score_func(X, y)
....
....
self.scores_ = np.asarray(self.scores_)
return self
Now we will define our new class SelectKBestCustom
with the changed fit()
. I have copied everything from the above source, changing only two lines (commented about it):
from sklearn.utils import check_X_y
class SelectKBestCustom(SelectKBest):
# Changed here
def fit(self, X, y, discrete_features='auto'):
X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)
if not callable(self.score_func):
raise TypeError("The score function should be a callable, %s (%s) "
"was passed."
% (self.score_func, type(self.score_func)))
self._check_params(X, y)
# Changed here also
score_func_ret = self.score_func(X, y, discrete_features)
if isinstance(score_func_ret, (list, tuple)):
self.scores_, self.pvalues_ = score_func_ret
self.pvalues_ = np.asarray(self.pvalues_)
else:
self.scores_ = score_func_ret
self.pvalues_ = None
self.scores_ = np.asarray(self.scores_)
return self
This can be called simply like:
clf = SelectKBestCustom(mutual_info_classif,k=2)
clf.fit(X, y, discrete_features=[0, 1, 2])
Edit:
The above solution can be useful in pipelines also, and the discrete_features
parameter can be assigned different values when calling fit()
.
Another Solution (less preferable):
Still, if you just need to work SelectKBest
with mutual_info_classif
, temporarily (just analysing the results), we can also make a custom function which can call mutual_info_classif
internally with hard coded discrete_features
. Something along the lines of:
def mutual_info_classif_custom(X, y):
# To change discrete_features,
# you need to redefine the function each time
# Because once the func def is supplied to selectKBest, it cant be changed
discrete_features = [0, 1, 2]
return mutual_info_classif(X, y, discrete_features)
Usage of the above function:
selector = SelectKBest(mutual_info_classif_custom).fit(X, y)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With