Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to perform grid search hyper-parameter optimization on One-Class SVM

Is there a way to use GridSearchCV or any other built-in sklearn function to find the best hyper-parameters for OneClassSVM classifier?

What I currently do, is perform the search myself using train/test split like this:

Gamma and nu values are defined as:

gammas = np.logspace(-9, 3, 13)
nus = np.linspace(0.01, 0.99, 99)

Function which explores all possible hyper-parameters and finds the best ones:

clf = OneClassSVM()

results = []

train_x = vectorizer.fit_transform(train_contents)
test_x = vectorizer.transform(test_contents)

for gamma in gammas:
    for nu in nus:
        clf.set_params(gamma=gamma, nu=nu)

        clf.fit(train_x)

        y_pred = clf.predict(test_x)

        if 1. in y_pred:  # Check if at least 1 review is predicted to be in the class
            results.append(((gamma, nu), (accuracy_score(y_true, y_pred),
                                              precision_score(y_true, y_pred),
                                              recall_score(y_true, y_pred),
                                              f1_score(y_true, y_pred),
                                              roc_auc_score(y_true, y_pred),
                                              ))
                               )

    # Determine and print the best parameter settings and their performance
    print_best_parameters(results, best_parameters(results))

Results are stored in a list of tuples of form:

((gamma, nu)(accuracy_score, precision_score, recall_score, f1_score, roc_auc_score))

To find the best accuracy, f1, roc_auc scores and parameters I wrote my own function:

best_parameters(results)

like image 330
Yustx Avatar asked Jun 22 '17 12:06

Yustx


1 Answers

I ran into this same problem and found this question while searching for a solution. I ended up finding a solution that uses GridSearchCV and am leaving this answer for anyone else who searches and finds this question.

The cv parameter of the GridSearchCV class can take as its input an iterable yielding (train, test) splits as arrays of indices. You can generate splits that use only data from the positive class in the training folds, and the remaining data in the positive class plus all data in the negative class in the testing folds.

You can use sklearn.model_selection.KFold to make the splits

from sklearn.model_selection import KFold

Suppose Xpos is an nXp numpy array of data for the positive class for the OneClassSVM and Xneg is an mXp array of data for known anomalous examples.

You can first generate splits for Xpos using

splits = KFold(n_splits=5).split(Xpos)

This will construct a generator of tuples of the form (train, test) where train is a numpy array of int containing indices for the examples in a training fold and test is a numpy array containing indices for examples in a test fold.

You can then combine Xpos and Xneg into a single dataset using

X = np.concatenate([Xpos, Xneg], axis=0)

The OneClassSVM will make prediction 1.0 for examples it thinks are in the positive class and prediction -1.0 for examples it thinks are anomalous. We can make labels for our data using

y = np.concatenate([np.repeat(1.0, len(Xpos)), np.repeat(-1.0, len(Xneg))])

We can then make a new generator of (train, test) splits with indices for the anomalous examples included in the test folds.

n, m = len(Xpos), len(Xneg)

splits = ((train, np.concatenate([test, np.arange(n, n + m)], axis=0)
          for train, test in splits)

You can then pass these splits to GridSearchCV using the data X, y and whatever scoring method and other parameters you wish.

grid_search = GridSearchCV(estimator, param_grid, cv=splits, scoring=...)

Edit: I hadn’t noticed that this approach was suggested in the comments of the other answer by Vivek Kumar, and that the OP had rejected it because they didn’t believe it would work with their method of choosing the best parameters. I still prefer the approach I’ve described because GridSearchCV will automatically handle multiprocessing and provides exception handling and informative warning and error messages.

It is also flexible in the choice of scoring method. You can use multiple scoring methods by passing a dictionary mapping strings to scoring callables and even define custom scoring callables. This is described in the Scikit-learn documentation here. A bespoke method of choosing the best parameters could likely be implemented with a custom scoring function. All of the metrics used by the OP could be included using the dictionary approach described in the documentation.

You can find a real world example here. I'll make a note to change the link when this gets merged into master.

like image 55
Albert Steppi Avatar answered Sep 26 '22 23:09

Albert Steppi