I am trying to use a deep learning model for time series prediction, and before passing the data to the model I want to scale the different variables as they have widely different ranges.
I have normally done this "on the fly": load the training subset of the data set, obtain the scaler from the whole subset, store it and then load it when I want to use it for testing.
Now the data is pretty big and I will not load all the training data at once for training.
How could I go to obtain the scaler? A priori I thought of doing a one-time operation of loading all the data just to calculate the scaler (normally I use the sklearn scalers, like StandardScaler), and then load it when I do my training process.
Is this a common practice? If it is, how would you do if you add data to the training dataset? can scalers be combined to avoid that one-time operation and just "update" the scaler?
StandardScaler in scikit-learn is able to calculate the mean and std of the data in incremental fashion (for small chunks of data) by using partial_fit():
partial_fit(X, y=None)
Online computation of mean and std on X for later scaling. All of X is processed as a single batch. This is intended for cases when fit is not feasible due to very large number of n_samples or because X is read from a continuous stream.
You will need two passes on the data:-
partial_fit() to calculate the mean and std), transform() it on the fly.Sample example:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# First pass
# some_generator can be anything which reads the data in batches
for data in some_generator:
scaler.partial_fit(data)
# View the updated mean and std variance at each batch
print(scaler.mean_)
print(scaler.var_)
# Second pass
for data in some_generator:
scaled_data = scaler.transform(data)
# Do whatever you want with the scaled_data
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With