I need to perform kernel pca on a dataset of dimension (5000, 26421) to get a lower dimension representation. To choose the number of components (say k) parameter, I am performing the reduction of the data and reconstruction to the original space and getting the mean square error of the reconstructed and original data for different values of k.
I came across sklearn's gridsearch functionality and want to use it for the above parameter estimation. Since there is no score function for kernel pca, I have implemented a custom scoring function and passing it to Gridsearch.
from sklearn.decomposition.kernel_pca import KernelPCA
from sklearn.model_selection import GridSearchCV
import numpy as np
import math
def scorer(clf, X):
Y1 = clf.inverse_transform(X)
error = math.sqrt(np.mean((X - Y1)**2))
return error
param_grid = [
{'degree': [1, 10], 'kernel': ['poly'], 'n_components': [100, 400, 100]},
{'gamma': [0.001, 0.0001], 'kernel': ['rbf'], 'n_components': [100, 400, 100]},
]
kpca = KernelPCA(fit_inverse_transform=True, n_jobs=30)
clf = GridSearchCV(estimator=kpca, param_grid=param_grid, scoring=scorer)
clf.fit(X)
However, it results in the below error:
/usr/lib64/python2.7/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X=array([[ 2., 2., 1., ..., 0., 0., 0.],
...., 0., 1., ..., 0., 0., 0.]], dtype=float32), Y=array([[-0.05904257, -0.02796719, 0.00919842, .... 0.00148251, -0.00311711]], dtype=float32), precomp
uted=False, dtype=<type 'numpy.float32'>)
117 "for %d indexed." %
118 (X.shape[0], X.shape[1], Y.shape[0]))
119 elif X.shape[1] != Y.shape[1]:
120 raise ValueError("Incompatible dimension for X and Y matrices: "
121 "X.shape[1] == %d while Y.shape[1] == %d" % (
--> 122 X.shape[1], Y.shape[1]))
X.shape = (1667, 26421)
Y.shape = (112, 100)
123
124 return X, Y
125
126
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 26421 while Y.shape[1] == 100
Can someone point out what exactly am I doing wrong?
Every estimator or model in Scikit-learn has a score method after being trained on the data, usually X_train, y_train . When you call score on classifiers like LogisticRegression, RandomForestClassifier, etc. the method computes the accuracy score by default (accuracy is #correct_preds / #all_preds).
GridSearchCV tries all the combinations of the values passed in the dictionary and evaluates the model for each combination using the Cross-Validation method. Hence after using this function we get accuracy/loss for every combination of hyperparameters and we can choose the one with the best performance.
param_grid – A dictionary with parameter names as keys and lists of parameter values. 3. scoring – The performance measure. For example, 'r2' for regression models, 'precision' for classification models.
The syntax of scoring function is incorrect. You only need to pass the predicted
and truth
values for the classifiers. So this is how you declare your custom scoring function :
def my_scorer(y_true, y_predicted):
error = math.sqrt(np.mean((y_true - y_predicted)**2))
return error
Then you can use make_scorer
function in Sklearn to pass it to the GridSearch.Be sure to set the greater_is_better
attribute accordingly:
Whether
score_func
is a score function (default), meaning high is good, or a loss function, meaning low is good. In the latter case, the scorer object will sign-flip the outcome of thescore_func
.
I am assuming you are calculating an error, so this attribute should set as False
, since lesser the error, the better:
from sklearn.metrics import make_scorer
my_func = make_scorer(my_scorer, greater_is_better=False)
Then you pass it to the GridSearch :
GridSearchCV(estimator=my_clf, param_grid=param_grid, scoring=my_func)
Where my_clf
is your classifier.
One more thing, I don't think GridSearchCV
is exactly what you are looking for. It basically accepts data in the form of train and test splits. But here you only want to transform your input data. You need to use Pipeline in Sklearn. Look at the example mentioned here of combining PCA and GridSearchCV.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With