Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Batch gradient descent with scikit learn (sklearn)

I'm playing with a one-vs-all Logistic Regression classifier using Scikit-Learn (sklearn). I have a large dataset that is too slow to run all at one go; also I would like to study the learning curve as the training proceeds.

I would like to use batch gradient descent to train my classifier in batches of, say, 500 samples. Is there some way of using sklearn to do this, or should I abandon sklearn and "roll my own"?

This is what I have so far:

from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# xs are subsets of my training data, ys are ground truth for same; I have more 
# data available for further training and cross-validation:
xs.shape, ys.shape
# => ((500, 784), (500))
lr = OneVsRestClassifier(LogisticRegression())
lr.fit(xs, ys)
lr.predict(xs[0,:])
# => [ 1.]
ys[0]
# => 1.0

I.e. it correctly identifies a training sample (yes, I realize it would be better to evaluate it with new data -- this is just a quick smoke-test).

R.e. batch gradient descent: I haven't gotten as far as creating learning curves, but can one simply run fit repeatedly on subsequent subsets of the training data? Or is there some other function to train in batches? The documentation and Google are fairly silent on the matter. Thanks!

like image 871
JohnJ Avatar asked Feb 23 '13 03:02

JohnJ


1 Answers

What you want is not batch gradient descent, but stochastic gradient descent; batch learning means learning on the entire training set in one go, while what you describe is properly called minibatch learning. That's implemented in sklearn.linear_model.SGDClassifier, which fits a logistic regression model if you give it the option loss="log".

With SGDClassifier, like with LogisticRegression, there's no need to wrap the estimator in a OneVsRestClassifier -- both do one-vs-all training out of the box.

# you'll have to set a few other options to get good estimates,
# in particular n_iterations, but this should get you going
lr = SGDClassifier(loss="log")

Then, to train on minibatches, use the partial_fit method instead of fit. The first time around, you have to feed it a list of classes because not all classes may be present in each minibatch:

import numpy as np
classes = np.unique(["ham", "spam", "eggs"])

for xs, ys in minibatches:
    lr.partial_fit(xs, ys, classes=classes)

(Here, I'm passing classes for each minibatch, which isn't necessary but doesn't hurt either and makes the code shorter.)

like image 90
Fred Foo Avatar answered Oct 24 '22 23:10

Fred Foo