I would like to check the prediction error of a new method trough cross-validation. I would like to know if I can pass my method to the cross-validation function of sklearn and in case how.
I would like something like sklearn.cross_validation(cv=10).mymethod
.
I need also to know how to define mymethod
should it be a function and which input element and which output
For example we can consider as mymethod
an implementation of the least square estimator (of course not the ones in sklearn) .
I found this tutorial link but it is not very clear to me.
In the documentation they use
>>> import numpy as np >>> from sklearn import cross_validation >>> from sklearn import datasets >>> from sklearn import svm >>> iris = datasets.load_iris() >>> iris.data.shape, iris.target.shape ((150, 4), (150,)) >>> clf = svm.SVC(kernel='linear', C=1) >>> scores = cross_validation.cross_val_score( ... clf, iris.data, iris.target, cv=5) ... >>> scores
But the problem is that they are using as estimator clf
that is obtained by a function built in sklearn. How should I define my own estimator in order that I can pass it to the cross_validation.cross_val_score
function?
So for example suppose a simple estimator that use a linear model $y=x\beta$ where beta is estimated as X[1,:]+alpha where alpha is a parameter. How should I complete the code?
class my_estimator(): def fit(X,y): beta=X[1,:]+alpha #where can I pass alpha to the function? return beta def scorer(estimator, X, y) #what should the scorer function compute? return ?????
With the following code I received an error:
class my_estimator(): def fit(X, y, **kwargs): #alpha = kwargs['alpha'] beta=X[1,:]#+alpha return beta
>>> cv=cross_validation.cross_val_score(my_estimator,x,y,scoring="mean_squared_error") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\cross_validation.py", line 1152, in cross_val_score for train, test in cv) File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\externals\joblib\parallel.py", line 516, in __call__ for function, args, kwargs in iterable: File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\cross_validation.py", line 1152, in <genexpr> for train, test in cv) File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\base.py", line 43, in clone % (repr(estimator), type(estimator))) TypeError: Cannot clone object '<class __main__.my_estimator at 0x05ACACA8>' (type <type 'classobj'>): it does not seem to be a scikit-learn estimator a it does not implement a 'get_params' methods. >>>
The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset. >>> from sklearn. model_selection import cross_val_score >>> clf = svm.
Given an estimator, the cross-validation object and the input dataset, the cross_val_score splits the data repeatedly into a training and a testing set, trains the estimator using the training set and computes the scores based on the testing set for each iteration of cross-validation.
Yes, GridSearchCV performs cross-validation. If I understand the concept correctly - you want to keep part of your data set unseen for the model in order to test it. So you train your models against train data set and test them on a testing data set.
The answer also lies in sklearn's documentation.
You need to define two things:
an estimator that implements the fit(X, y)
function, X
being the matrix with inputs and y
being the vector of outputs
a scorer function, or callable object that can be used with: scorer(estimator, X, y)
and returns the score of given model
Referring to your example: first of all, scorer
shouldn't be a method of the estimator, it's a different notion. Just create a callable:
def scorer(estimator, X, y) return ????? # compute whatever you want, it's up to you to define # what does it mean that the given estimator is "good" or "bad"
Or even a more simple solution: you can pass a string 'mean_squared_error'
or 'accuracy'
(full list available in this part of the documentation) to cross_val_score
function to use a predefined scorer.
Another possibility is to use make_scorer
factory function.
As for the second thing, you can pass parameters to your model through the fit_params
dict
parameter of the cross_val_score
function (as mentioned in the documentation). These parameters will be passed to the fit
function.
class my_estimator(): def fit(X, y, **kwargs): alpha = kwargs['alpha'] beta=X[1,:]+alpha return beta
After reading all the error messages, which provide quite clear idea of what's missing, here is a simple example:
import numpy as np from sklearn.cross_validation import cross_val_score class RegularizedRegressor: def __init__(self, l = 0.01): self.l = l def combine(self, inputs): return sum([i*w for (i,w) in zip([1] + inputs, self.weights)]) def predict(self, X): return [self.combine(x) for x in X] def classify(self, inputs): return sign(self.predict(inputs)) def fit(self, X, y, **kwargs): self.l = kwargs['l'] X = np.matrix(X) y = np.matrix(y) W = (X.transpose() * X).getI() * X.transpose() * y self.weights = [w[0] for w in W.tolist()] def get_params(self, deep = False): return {'l':self.l} X = np.matrix([[0, 0], [1, 0], [0, 1], [1, 1]]) y = np.matrix([0, 1, 1, 0]).transpose() print cross_val_score(RegularizedRegressor(), X, y, fit_params={'l':0.1}, scoring = 'mean_squared_error')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With