My dataset is getting too large and I'm looking for online learning solutions in sklearn, which they refer to as out-of-core learning.
They offer some classes which use a partial fit API, which basically lets you keep a subset of your data in memory and operate on it. However, a lot of preprocessing stages (such as data scaling) retain parameters during their fit stage on training data, which is then used for transformations.
For example, if you use a min-max scaler to bound features to [-1, 1] or standardize your data, the parameters they learn and eventually use to transform data are learned from a subset of the training data they happen to be operating on in a given iteration.
This means it's possible that the parameters learned during the fit stage on one subset of training data could be different from another subset of training data, since they are training set specific. And there lies the heart of my question:
How can you combine parameters learned during the fit stage of a preprocessing step when using online/out-of-core learning, when the learned parameters are a function of the training data?
You can fit the StandardScaler
instance on a large enough subset that fit in RAM at once (say a couple of GB of data) and then re-use the same fixed instance of the scaler to transform the rest of the data one batch at a time. You should be able to get a good estimate of the mean and std values of each feature on a couple of thousands of samples so there is no need to compute the actual fit on the full data just for the scaler.
It would still be nice to add a partial_fit
method to the StandardScaler
class, implementing streaming mean & variance estimation for completeness.
But even if StandardScaler
had a partial_fit
method you would still need to do several path of the data (and optionally store the preprocessed data on the drive for later reuse):
standard_scaler.partial_fit()
on all the original data chunksstandard_scaler.transform
on each chunk of the original data then pass the result to the model.partial_fit
method.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With