from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
What I know is fit() method calculates mean and standard deviation of the feature and then transform() method uses them to transform the feature into a new scaled feature. fit_transform() is nothing but calling fit() & transform() method in a single line.
But here why are we only calling fit() for training data and not for testing data??
Does that means we are using mean & standard deviation of training data to transform our testing data ??
fit computes the mean and stdev to be used for later scaling, note it's just a computation with no scaling done.
transform uses the previously computed mean and stdev to scale the data (subtract mean from all values and then divide it by stdev).
fit_transform does both at the same time. So you can do it with just 1 line of code.
For X_train dataset, we do fit_transform because we need to compute mean and stdev, and then use it to scale the X_train dataset. For X_test dataset, since we already have the mean and stdev, we only do the transformation part.
Edit: X_test data should be totally unseen and unknown (ie, no info is extracted from them), so we can only derive info from X_train. The reason why we apply the derived mean and stdev (from X_train) to transform X_test as well, is to have the same "apple-to-apple" comparison for y_test and y_pred.
By the way, if the train/test data is split properly without bias, and that the data is sufficiently large, both datasets would have the same approximation to the population mean and stdev.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With