Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-Learn GridSearch custom scoring function

Tags:

scikit-learn

I need to perform kernel pca on a dataset of dimension (5000, 26421) to get a lower dimension representation. To choose the number of components (say k) parameter, I am performing the reduction of the data and reconstruction to the original space and getting the mean square error of the reconstructed and original data for different values of k.

I came across sklearn's gridsearch functionality and want to use it for the above parameter estimation. Since there is no score function for kernel pca, I have implemented a custom scoring function and passing it to Gridsearch.

from sklearn.decomposition.kernel_pca import KernelPCA
from sklearn.model_selection import GridSearchCV
import numpy as np
import math

def scorer(clf, X):
    Y1 = clf.inverse_transform(X)
    error = math.sqrt(np.mean((X - Y1)**2))
    return error

param_grid = [
    {'degree': [1, 10], 'kernel': ['poly'], 'n_components': [100, 400, 100]},
    {'gamma': [0.001, 0.0001], 'kernel': ['rbf'], 'n_components': [100, 400, 100]},
]

kpca = KernelPCA(fit_inverse_transform=True, n_jobs=30)
clf = GridSearchCV(estimator=kpca, param_grid=param_grid, scoring=scorer)
clf.fit(X)

However, it results in the below error:

/usr/lib64/python2.7/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X=array([[ 2.,  2.,  1., ...,  0.,  0.,  0.],
    ....,  0.,  1., ...,  0.,  0.,  0.]], dtype=float32), Y=array([[-0.05904257, -0.02796719,  0.00919842, ....        0.00148251, -0.00311711]], dtype=float32), precomp
uted=False, dtype=<type 'numpy.float32'>)
    117                              "for %d indexed." %
    118                              (X.shape[0], X.shape[1], Y.shape[0]))
    119     elif X.shape[1] != Y.shape[1]:
    120         raise ValueError("Incompatible dimension for X and Y matrices: "
    121                          "X.shape[1] == %d while Y.shape[1] == %d" % (
--> 122                              X.shape[1], Y.shape[1]))
        X.shape = (1667, 26421)
        Y.shape = (112, 100)
    123 
    124     return X, Y
    125 
    126 

ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 26421 while Y.shape[1] == 100

Can someone point out what exactly am I doing wrong?

like image 303
user1683894 Avatar asked Sep 13 '17 23:09

user1683894


People also ask

What is score function in scikit-learn?

Every estimator or model in Scikit-learn has a score method after being trained on the data, usually X_train, y_train . When you call score on classifiers like LogisticRegression, RandomForestClassifier, etc. the method computes the accuracy score by default (accuracy is #correct_preds / #all_preds).

How does sklearn GridSearchCV work?

GridSearchCV tries all the combinations of the values passed in the dictionary and evaluates the model for each combination using the Cross-Validation method. Hence after using this function we get accuracy/loss for every combination of hyperparameters and we can choose the one with the best performance.

What is Param_grid in GridSearchCV?

param_grid – A dictionary with parameter names as keys and lists of parameter values. 3. scoring – The performance measure. For example, 'r2' for regression models, 'precision' for classification models.


1 Answers

The syntax of scoring function is incorrect. You only need to pass the predicted and truth values for the classifiers. So this is how you declare your custom scoring function :

def my_scorer(y_true, y_predicted):
    error = math.sqrt(np.mean((y_true - y_predicted)**2))
    return error

Then you can use make_scorer function in Sklearn to pass it to the GridSearch.Be sure to set the greater_is_better attribute accordingly:

Whether score_func is a score function (default), meaning high is good, or a loss function, meaning low is good. In the latter case, the scorer object will sign-flip the outcome of the score_func.

I am assuming you are calculating an error, so this attribute should set as False, since lesser the error, the better:

from sklearn.metrics import make_scorer
my_func = make_scorer(my_scorer, greater_is_better=False)

Then you pass it to the GridSearch :

GridSearchCV(estimator=my_clf, param_grid=param_grid, scoring=my_func)

Where my_clf is your classifier.

One more thing, I don't think GridSearchCV is exactly what you are looking for. It basically accepts data in the form of train and test splits. But here you only want to transform your input data. You need to use Pipeline in Sklearn. Look at the example mentioned here of combining PCA and GridSearchCV.

like image 195
Gambit1614 Avatar answered Sep 27 '22 22:09

Gambit1614