Is there a way to use GridSearchCV or any other built-in sklearn function to find the best hyper-parameters for OneClassSVM classifier?
What I currently do, is perform the search myself using train/test split like this:
Gamma and nu values are defined as:
gammas = np.logspace(-9, 3, 13)
nus = np.linspace(0.01, 0.99, 99)
Function which explores all possible hyper-parameters and finds the best ones:
clf = OneClassSVM()
results = []
train_x = vectorizer.fit_transform(train_contents)
test_x = vectorizer.transform(test_contents)
for gamma in gammas:
for nu in nus:
clf.set_params(gamma=gamma, nu=nu)
clf.fit(train_x)
y_pred = clf.predict(test_x)
if 1. in y_pred: # Check if at least 1 review is predicted to be in the class
results.append(((gamma, nu), (accuracy_score(y_true, y_pred),
precision_score(y_true, y_pred),
recall_score(y_true, y_pred),
f1_score(y_true, y_pred),
roc_auc_score(y_true, y_pred),
))
)
# Determine and print the best parameter settings and their performance
print_best_parameters(results, best_parameters(results))
Results are stored in a list of tuples of form:
((gamma, nu)(accuracy_score, precision_score, recall_score, f1_score, roc_auc_score))
To find the best accuracy, f1, roc_auc scores and parameters I wrote my own function:
best_parameters(results)
I ran into this same problem and found this question while searching for a solution. I ended up finding a solution that uses GridSearchCV
and am leaving this answer for anyone else who searches and finds this question.
The cv
parameter of the GridSearchCV class can take as its input an iterable yielding (train, test) splits as arrays of indices. You can generate splits that use only data from the positive class in the training folds, and the remaining data in the positive class plus all data in the negative class in the testing folds.
You can use sklearn.model_selection.KFold
to make the splits
from sklearn.model_selection import KFold
Suppose Xpos
is an nXp numpy array of data for the positive class for the OneClassSVM
and Xneg
is an mXp array of data for known anomalous examples.
You can first generate splits for Xpos
using
splits = KFold(n_splits=5).split(Xpos)
This will construct a generator of tuples of the form (train, test)
where train
is a numpy array of int containing indices for the examples in a training fold and test
is a numpy array containing indices for examples in a test fold.
You can then combine Xpos
and Xneg
into a single dataset using
X = np.concatenate([Xpos, Xneg], axis=0)
The OneClassSVM
will make prediction 1.0
for examples it thinks are in the positive class and prediction -1.0
for examples it thinks are anomalous. We can make labels for our data using
y = np.concatenate([np.repeat(1.0, len(Xpos)), np.repeat(-1.0, len(Xneg))])
We can then make a new generator of (train, test)
splits with indices for the anomalous examples included in the test folds.
n, m = len(Xpos), len(Xneg)
splits = ((train, np.concatenate([test, np.arange(n, n + m)], axis=0)
for train, test in splits)
You can then pass these splits to GridSearchCV
using the data X, y
and whatever scoring method and other parameters you wish.
grid_search = GridSearchCV(estimator, param_grid, cv=splits, scoring=...)
Edit: I hadn’t noticed that this approach was suggested in the comments of the other answer by Vivek Kumar, and that the OP had rejected it because they didn’t believe it would work with their method of choosing the best parameters. I still prefer the approach I’ve described because GridSearchCV will automatically handle multiprocessing and provides exception handling and informative warning and error messages.
It is also flexible in the choice of scoring method. You can use multiple scoring methods by passing a dictionary mapping strings to scoring callables and even define custom scoring callables. This is described in the Scikit-learn documentation here. A bespoke method of choosing the best parameters could likely be implemented with a custom scoring function. All of the metrics used by the OP could be included using the dictionary approach described in the documentation.
You can find a real world example here. I'll make a note to change the link when this gets merged into master.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With