Ive created a custom hand-coded classifier which implements the standard sklearn classifier functions(fit()
,predict()
and predict_proba()
). Can this be directly used with the sklearn utility GridSearchCV()
or are there any additions that should be made?
EDIT 1 : On cel's suggestion I tried applying it directly
The first step was to add the get_params and set_params as explained here. Sure enough the complete cross validation procedure did run but ends up with the following error
return self._fit(X, y, ParameterGrid(self.param_grid))
best_estimator.fit(X, y, **self.fit_params)
AttributeError: 'NoneType' object has no attribute 'fit'
EDIT 2: Adding the classifier code(its a theano based Logistic Regression Classifier)
class LogisticRegression:
""" Apply minibatch logistic regression
:type n_in: int
:param n_in: number of input units, the dimension of the space in
which the datapoints lie
:type n_out: int
:param n_out: number of output units, the dimension of the space in
which the labels lie
"""
def __init__(self,n_in,n_out,batch_size=600,learning_rate=0.13,iters=500,verbose=0):
self.n_in = n_in
self.n_out = n_out
self.batch_size = batch_size
self.learning_rate = learning_rate
self.iters = iters
self.verbose = verbose
self.single_layer = Layer(self.n_in,self.n_out,T.nnet.softmax)
self.minibatch_count = 0
def get_params(self,deep=True):
return {"n_in" : self.n_in,"n_out" : self.n_out,"batch_size" : self.batch_size,
"learning_rate" : self.learning_rate,"iters" : self.iters,
"verbose" : self.verbose}
def set_params(self, **parameters):
for parameter, value in parameters.items():
setattr(self, parameter, value)
def minibatch_trainer(self,data_x,data_y):
n_batches = data_x.get_value(borrow=True).shape[0]/self.batch_size
tensor_x = T.matrix('x')
tensor_y = T.ivector('y')
index = T.lscalar('index')
cost = self.single_layer.negative_log_likelihood(tensor_x, tensor_y)
g_W = T.grad(cost,self.single_layer.W)
g_b = T.grad(cost,self.single_layer.b)
updates = [(self.single_layer.W,self.single_layer.W - g_W*self.learning_rate),
(self.single_layer.b,self.single_layer.b - g_b*self.learning_rate)]
train_batch = theano.function([index],[cost],
updates=updates,
givens={tensor_x : data_x[index*self.batch_size : (index + 1)*self.batch_size],
tensor_y : data_y[index*self.batch_size : (index + 1)*self.batch_size]})
return np.mean([train_batch(i) for i in xrange(n_batches)])
def fit(self,data_x,data_y):
data_x,data_y = shared_dataset(data_x,data_y)
start = time.clock()
for iter in xrange(self.iters):
train_err = self.minibatch_trainer(data_x,data_y)
if self.verbose==1: print "Iter %d --> %f" % (iter,train_err)
end = time.clock()
print "Finished Training Logistic Regression Model\n" \
"Iterations %d\n" \
"Time Taken : %d secs" % (self.iters,end - start)
return self
def partial_fit(self,data_x,data_y):
data_x,data_y = shared_dataset(data_x,data_y)
self.minibatch_count += 1
err = self.minibatch_trainer(data_x, data_y)
print "MiniBatch %d --> %f" % (self.minibatch_count,err)
def predict(self,data_x):
data_x = shared_dataset(data_x)
n_batches = data_x.get_value(borrow=True).shape[0]/self.batch_size
tensor_x = T.matrix('x')
index = T.lscalar('index')
tensor_ypred = self.prediction_tensor(tensor_x)
predictor = theano.function([index],tensor_ypred,
givens={tensor_x : data_x[index*self.batch_size:(index + 1)*self.batch_size]})
ypred = [predictor(i) for i in xrange(n_batches)]
return np.hstack(ypred)
def predict_proba(self,data_x):
data_x = shared_dataset(data_x)
tensor_x = T.matrix('x')
tensor_ypredproba = self.single_layer.decision_function_tensor(tensor_x)
predproba_func = theano.function([],tensor_ypredproba,
givens={tensor_x : data_x})
return predproba_func()
def prediction_tensor(self,tensor_x):
"""
Returns the predicted y value as a tensor variable
:param tensor_x: TensorType matrix on input data
:return: TensorType tensor_ypred output
"""
return T.argmax(self.single_layer.decision_function_tensor(tensor_x),axis=1)
EDIT 3: Adding exact usage of GridSearchCV
clf_cv = GridSearchCV(LogisticRegression(n_in=200,n_out=2),{"iters" : [3]},cv=4,scoring="roc_auc",n_jobs=-1,verbose=1)
Ive also tried adding BaseEstimator and ClassifierMixin; sklearn.base.clone does not output any errors
GridSearchCV is a technique for finding the optimal parameter values from a given set of parameters in a grid. It's essentially a cross-validation technique. The model as well as the parameters must be entered. After extracting the best parameter values, predictions are made.
This may need extra memory as per documentation if the dataset is big and you may have to use pre_dispatch parameter. I have 3 parameters with 10 levels to scan and the time for a run is about 19 seconds. Hence, 10*3*19=570/60=~10 minutes. But I definitely have to wait about 35-45 minutes.
Judging by the documentation if you specify an integer GridSearchCV already uses stratified KFold in some cases: "For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used.
Had the same problems a couple of minutes ago. The documentation is incorrect. You have to change set_params
to return self
:
def set_params(self, **parameters):
for parameter, value in parameters.items():
setattr(self, parameter, value)
return self
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With