Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write a custom estimator in sklearn and use cross-validation on it?

I would like to check the prediction error of a new method trough cross-validation. I would like to know if I can pass my method to the cross-validation function of sklearn and in case how.

I would like something like sklearn.cross_validation(cv=10).mymethod.

I need also to know how to define mymethod should it be a function and which input element and which output

For example we can consider as mymethod an implementation of the least square estimator (of course not the ones in sklearn) .

I found this tutorial link but it is not very clear to me.

In the documentation they use

>>> import numpy as np >>> from sklearn import cross_validation >>> from sklearn import datasets >>> from sklearn import svm  >>> iris = datasets.load_iris() >>> iris.data.shape, iris.target.shape ((150, 4), (150,))   >>> clf = svm.SVC(kernel='linear', C=1)   >>> scores = cross_validation.cross_val_score(  ...    clf, iris.data, iris.target, cv=5)  ...  >>> scores       

But the problem is that they are using as estimator clf that is obtained by a function built in sklearn. How should I define my own estimator in order that I can pass it to the cross_validation.cross_val_score function?

So for example suppose a simple estimator that use a linear model $y=x\beta$ where beta is estimated as X[1,:]+alpha where alpha is a parameter. How should I complete the code?

class my_estimator():       def fit(X,y):           beta=X[1,:]+alpha #where can I pass alpha to the function?           return beta       def scorer(estimator, X, y) #what should the scorer function compute?           return ????? 

With the following code I received an error:

class my_estimator():     def fit(X, y, **kwargs):         #alpha = kwargs['alpha']         beta=X[1,:]#+alpha          return beta 

>>> cv=cross_validation.cross_val_score(my_estimator,x,y,scoring="mean_squared_error") Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\cross_validation.py", line 1152, in cross_val_score     for train, test in cv)   File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\externals\joblib\parallel.py", line 516, in __call__     for function, args, kwargs in iterable:   File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\cross_validation.py", line 1152, in <genexpr>     for train, test in cv)   File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\base.py", line 43, in clone     % (repr(estimator), type(estimator))) TypeError: Cannot clone object '<class __main__.my_estimator at 0x05ACACA8>' (type <type 'classobj'>): it does not seem to be a scikit-learn estimator a it does not implement a 'get_params' methods. >>>  
like image 507
Donbeo Avatar asked Dec 02 '13 14:12

Donbeo


People also ask

How do you cross validate with sklearn?

The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset. >>> from sklearn. model_selection import cross_val_score >>> clf = svm.

What is estimator in cross-validation score?

Given an estimator, the cross-validation object and the input dataset, the cross_val_score splits the data repeatedly into a training and a testing set, trains the estimator using the training set and computes the scores based on the testing set for each iteration of cross-validation.

Does GridSearchCV do cross-validation?

Yes, GridSearchCV performs cross-validation. If I understand the concept correctly - you want to keep part of your data set unseen for the model in order to test it. So you train your models against train data set and test them on a testing data set.


1 Answers

The answer also lies in sklearn's documentation.

You need to define two things:

  • an estimator that implements the fit(X, y) function, X being the matrix with inputs and y being the vector of outputs

  • a scorer function, or callable object that can be used with: scorer(estimator, X, y) and returns the score of given model

Referring to your example: first of all, scorer shouldn't be a method of the estimator, it's a different notion. Just create a callable:

def scorer(estimator, X, y)     return ?????  # compute whatever you want, it's up to you to define                   # what does it mean that the given estimator is "good" or "bad" 

Or even a more simple solution: you can pass a string 'mean_squared_error' or 'accuracy' (full list available in this part of the documentation) to cross_val_score function to use a predefined scorer.

Another possibility is to use make_scorer factory function.

As for the second thing, you can pass parameters to your model through the fit_params dict parameter of the cross_val_score function (as mentioned in the documentation). These parameters will be passed to the fit function.

class my_estimator():     def fit(X, y, **kwargs):         alpha = kwargs['alpha']         beta=X[1,:]+alpha          return beta 

After reading all the error messages, which provide quite clear idea of what's missing, here is a simple example:

import numpy as np from sklearn.cross_validation import cross_val_score  class RegularizedRegressor:     def __init__(self, l = 0.01):         self.l = l      def combine(self, inputs):         return sum([i*w for (i,w) in zip([1] + inputs, self.weights)])      def predict(self, X):         return [self.combine(x) for x in X]      def classify(self, inputs):         return sign(self.predict(inputs))      def fit(self, X, y, **kwargs):         self.l = kwargs['l']         X = np.matrix(X)         y = np.matrix(y)         W = (X.transpose() * X).getI() * X.transpose() * y          self.weights = [w[0] for w in W.tolist()]      def get_params(self, deep = False):         return {'l':self.l}  X = np.matrix([[0, 0], [1, 0], [0, 1], [1, 1]]) y = np.matrix([0, 1, 1, 0]).transpose()  print cross_val_score(RegularizedRegressor(),                       X,                       y,                        fit_params={'l':0.1},                       scoring = 'mean_squared_error') 
like image 145
BartoszKP Avatar answered Sep 19 '22 18:09

BartoszKP