Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can GridSearchCV be used with a custom classifier?

Ive created a custom hand-coded classifier which implements the standard sklearn classifier functions(fit(),predict() and predict_proba()). Can this be directly used with the sklearn utility GridSearchCV() or are there any additions that should be made?

EDIT 1 : On cel's suggestion I tried applying it directly

The first step was to add the get_params and set_params as explained here. Sure enough the complete cross validation procedure did run but ends up with the following error

return self._fit(X, y, ParameterGrid(self.param_grid))
best_estimator.fit(X, y, **self.fit_params)
AttributeError: 'NoneType' object has no attribute 'fit'

EDIT 2: Adding the classifier code(its a theano based Logistic Regression Classifier)

class LogisticRegression:
    """ Apply minibatch logistic regression

    :type n_in: int
    :param n_in: number of input units, the dimension of the space in
                 which the datapoints lie

    :type n_out: int
    :param n_out: number of output units, the dimension of the space in
                  which the labels lie

    """

    def __init__(self,n_in,n_out,batch_size=600,learning_rate=0.13,iters=500,verbose=0):
        self.n_in = n_in
        self.n_out = n_out
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.iters = iters
        self.verbose = verbose
        self.single_layer = Layer(self.n_in,self.n_out,T.nnet.softmax)
        self.minibatch_count = 0

    def get_params(self,deep=True):
        return {"n_in" : self.n_in,"n_out" : self.n_out,"batch_size" : self.batch_size,
                "learning_rate" : self.learning_rate,"iters" : self.iters,
                "verbose" : self.verbose}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)

    def minibatch_trainer(self,data_x,data_y):
        n_batches = data_x.get_value(borrow=True).shape[0]/self.batch_size
        tensor_x = T.matrix('x')
        tensor_y = T.ivector('y')
        index = T.lscalar('index')
        cost = self.single_layer.negative_log_likelihood(tensor_x, tensor_y)
        g_W = T.grad(cost,self.single_layer.W)
        g_b = T.grad(cost,self.single_layer.b)
        updates = [(self.single_layer.W,self.single_layer.W - g_W*self.learning_rate),
                    (self.single_layer.b,self.single_layer.b - g_b*self.learning_rate)]
        train_batch = theano.function([index],[cost],
                                      updates=updates,
                                      givens={tensor_x : data_x[index*self.batch_size : (index + 1)*self.batch_size],
                                              tensor_y : data_y[index*self.batch_size : (index + 1)*self.batch_size]})
        return np.mean([train_batch(i) for i in xrange(n_batches)])

    def fit(self,data_x,data_y):
        data_x,data_y = shared_dataset(data_x,data_y)
        start = time.clock()
        for iter in xrange(self.iters):
            train_err = self.minibatch_trainer(data_x,data_y)
            if self.verbose==1: print "Iter %d --> %f" % (iter,train_err)
        end = time.clock()
        print "Finished Training Logistic Regression Model\n" \
              "Iterations %d\n" \
              "Time Taken : %d secs" % (self.iters,end - start)
        return self

    def partial_fit(self,data_x,data_y):
        data_x,data_y = shared_dataset(data_x,data_y)
        self.minibatch_count += 1
        err = self.minibatch_trainer(data_x, data_y)
        print "MiniBatch %d --> %f" % (self.minibatch_count,err)

    def predict(self,data_x):
        data_x = shared_dataset(data_x)
        n_batches = data_x.get_value(borrow=True).shape[0]/self.batch_size
        tensor_x = T.matrix('x')
        index = T.lscalar('index')
        tensor_ypred = self.prediction_tensor(tensor_x)
        predictor = theano.function([index],tensor_ypred,
                                    givens={tensor_x : data_x[index*self.batch_size:(index + 1)*self.batch_size]})
        ypred = [predictor(i) for i in xrange(n_batches)]
        return np.hstack(ypred)

    def predict_proba(self,data_x):
        data_x = shared_dataset(data_x)
        tensor_x = T.matrix('x')
        tensor_ypredproba = self.single_layer.decision_function_tensor(tensor_x)
        predproba_func = theano.function([],tensor_ypredproba,
                                           givens={tensor_x : data_x})
        return predproba_func()

    def prediction_tensor(self,tensor_x):
        """
        Returns the predicted y value as a tensor variable
        :param tensor_x: TensorType matrix on input data
        :return: TensorType tensor_ypred output
        """
        return T.argmax(self.single_layer.decision_function_tensor(tensor_x),axis=1)

EDIT 3: Adding exact usage of GridSearchCV

clf_cv = GridSearchCV(LogisticRegression(n_in=200,n_out=2),{"iters" : [3]},cv=4,scoring="roc_auc",n_jobs=-1,verbose=1)

Ive also tried adding BaseEstimator and ClassifierMixin; sklearn.base.clone does not output any errors

like image 348
tangy Avatar asked Jan 24 '15 10:01

tangy


People also ask

What is GridSearchCV used for?

GridSearchCV is a technique for finding the optimal parameter values from a given set of parameters in a grid. It's essentially a cross-validation technique. The model as well as the parameters must be entered. After extracting the best parameter values, predictions are made.

How long does it take to run GridSearchCV?

This may need extra memory as per documentation if the dataset is big and you may have to use pre_dispatch parameter. I have 3 parameters with 10 levels to scan and the time for a run is about 19 seconds. Hence, 10*3*19=570/60=~10 minutes. But I definitely have to wait about 35-45 minutes.

Is GridSearchCV stratified?

Judging by the documentation if you specify an integer GridSearchCV already uses stratified KFold in some cases: "For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used.


1 Answers

Had the same problems a couple of minutes ago. The documentation is incorrect. You have to change set_params to return self:

def set_params(self, **parameters):
  for parameter, value in parameters.items():
    setattr(self, parameter, value)
  return self
like image 149
memecs Avatar answered Oct 31 '22 18:10

memecs