Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I SelectKBest using mutual information from a mixture of discrete and continuous features?

I am using scikit learn to train a classification model. I have both discrete and continuous features in my training data. I want to do feature selection using maximum mutual information. If I have vectors x and labels y and the first three feature values are discrete I can get the MMI values like so:

mutual_info_classif(x, y, discrete_features=[0, 1, 2])

Now I'd like to use the same mutual information selection in a pipeline. I'd like to do something like this

SelectKBest(score_func=mutual_info_classif).fit(x, y)

but there's no way to pass the discrete features mask to SelectKBest. Is there some syntax to do this that I'm overlooking, or do I have to write my own score function wrapper?

like image 207
W.P. McNeill Avatar asked Mar 09 '23 06:03

W.P. McNeill


1 Answers

Unfortunately I could not find this functionality for the SelectKBest. But what we can do easily is extend the SelectKBest as our custom class to override the fit() method which will be called.

This is the current fit() method of SelectKBest (taken from source at github)

# No provision for extra parameters here
def fit(self, X, y):
    X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)

    ....
    ....

    # Here only the X, y are passed to scoring function
    score_func_ret = self.score_func(X, y)

    ....        
    ....

    self.scores_ = np.asarray(self.scores_)

    return self

Now we will define our new class SelectKBestCustom with the changed fit(). I have copied everything from the above source, changing only two lines (commented about it):

from sklearn.utils import check_X_y

class SelectKBestCustom(SelectKBest):

    # Changed here
    def fit(self, X, y, discrete_features='auto'):
        X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)

        if not callable(self.score_func):
            raise TypeError("The score function should be a callable, %s (%s) "
                        "was passed."
                        % (self.score_func, type(self.score_func)))

        self._check_params(X, y)

        # Changed here also
        score_func_ret = self.score_func(X, y, discrete_features)
        if isinstance(score_func_ret, (list, tuple)):
            self.scores_, self.pvalues_ = score_func_ret
            self.pvalues_ = np.asarray(self.pvalues_)
        else:
            self.scores_ = score_func_ret
            self.pvalues_ = None

        self.scores_ = np.asarray(self.scores_)
        return self

This can be called simply like:

clf = SelectKBestCustom(mutual_info_classif,k=2)
clf.fit(X, y, discrete_features=[0, 1, 2])

Edit: The above solution can be useful in pipelines also, and the discrete_features parameter can be assigned different values when calling fit().

Another Solution (less preferable): Still, if you just need to work SelectKBest with mutual_info_classif, temporarily (just analysing the results), we can also make a custom function which can call mutual_info_classif internally with hard coded discrete_features. Something along the lines of:

def mutual_info_classif_custom(X, y):
    # To change discrete_features, 
    # you need to redefine the function each time
    # Because once the func def is supplied to selectKBest, it cant be changed
    discrete_features = [0, 1, 2]

    return mutual_info_classif(X, y, discrete_features)

Usage of the above function:

selector = SelectKBest(mutual_info_classif_custom).fit(X, y)
like image 133
Vivek Kumar Avatar answered May 12 '23 17:05

Vivek Kumar