Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how can I combine training set specific learned parameters with sklearn online (out-of-core) learning

My dataset is getting too large and I'm looking for online learning solutions in sklearn, which they refer to as out-of-core learning.

They offer some classes which use a partial fit API, which basically lets you keep a subset of your data in memory and operate on it. However, a lot of preprocessing stages (such as data scaling) retain parameters during their fit stage on training data, which is then used for transformations.

For example, if you use a min-max scaler to bound features to [-1, 1] or standardize your data, the parameters they learn and eventually use to transform data are learned from a subset of the training data they happen to be operating on in a given iteration.

This means it's possible that the parameters learned during the fit stage on one subset of training data could be different from another subset of training data, since they are training set specific. And there lies the heart of my question:

How can you combine parameters learned during the fit stage of a preprocessing step when using online/out-of-core learning, when the learned parameters are a function of the training data?

like image 520
trianta2 Avatar asked Jan 23 '15 16:01

trianta2


1 Answers

You can fit the StandardScaler instance on a large enough subset that fit in RAM at once (say a couple of GB of data) and then re-use the same fixed instance of the scaler to transform the rest of the data one batch at a time. You should be able to get a good estimate of the mean and std values of each feature on a couple of thousands of samples so there is no need to compute the actual fit on the full data just for the scaler.

It would still be nice to add a partial_fit method to the StandardScaler class, implementing streaming mean & variance estimation for completeness.

But even if StandardScaler had a partial_fit method you would still need to do several path of the data (and optionally store the preprocessed data on the drive for later reuse):

  • first pass: call standard_scaler.partial_fit() on all the original data chunks
  • second pass: call standard_scaler.transform on each chunk of the original data then pass the result to the model.partial_fit method.
like image 105
ogrisel Avatar answered Oct 18 '22 07:10

ogrisel