When scale the data, why the train dataset use 'fit' and 'transform', but the test dataset only use 'transform'?
SAMPLE_COUNT = 5000 TEST_COUNT = 20000 seed(0) sample = list() test_sample = list() for index, line in enumerate(open('covtype.data','rb')): if index < SAMPLE_COUNT: sample.append(line) else: r = randint(0,index) if r < SAMPLE_COUNT: sample[r] = line else: k = randint(0,index) if k < TEST_COUNT: if len(test_sample) < TEST_COUNT: test_sample.append(line) else: test_sample[k] = line from sklearn.preprocessing import StandardScaler for n, line in enumerate(sample): sample[n] = map(float, line.strip().split(',')) y = np.array(sample)[:,-1] scaling = StandardScaler() X = scaling.fit_transform(np.array(sample)[:,:-1]) ##here use fit and transform for n,line in enumerate(test_sample): test_sample[n] = map(float,line.strip().split(',')) yt = np.array(test_sample)[:,-1] Xt = scaling.transform(np.array(test_sample)[:,:-1])##why here only use transform
As the annotation says, why Xt only use transform but no fit?
fit_transform() is used on the training data so that we can scale the training data and also learn the scaling parameters of that data. Here, the model built by us will learn the mean and variance of the features of the training set. These learned parameters are then used to scale our test data.
By using similar data for training and testing, you can minimize the effects of data discrepancies and better understand the characteristics of the model. After a model has been processed by using the training set, you test the model by making predictions against the test set.
The fit(data) method is used to compute the mean and std dev for a given feature to be used further for scaling. The transform(data) method is used to perform scaling using mean and std dev calculated using the . fit() method.
You aren't going to be able to get “another test set” easily, so you want the test set that you have to be used once so that it provides the best possible estimate of the model's generalization ability.
Show activity on this post. We use fit_transform () on the train data so that we learn the parameters of scaling on the train data and in the same time we scale the train data. We only use transform () on the test data because we use the scaling paramaters learned on the train data to scale the test data.
Including the test dataset in the transform computation will allow information to flow from the test data to the train data and therefore to the model that learns from it, thus allowing the model to cheat (introducing a bias). Also, it is important not to confuse transformations with augmentations.
Therefore, train and test datasets are the two key concepts of machine learning, where the training dataset is used to fit the model, and the test dataset is used to evaluate the model. In this topic, we are going to discuss train and test datasets along with the difference between both of them.
We decide to scale both the features in the training dataset before applying linear regression and fitting the linear regression function. When we scale the features of the training dataset, all 'x1' features get adjusted according to the mean and the standard deviations of the different samples w. r. t to their 'x1' feature values.
We use fit_transform()
on the train data so that we learn the parameters of scaling on the train data and in the same time we scale the train data. We only use transform()
on the test data because we use the scaling paramaters learned on the train data to scale the test data.
This is the standart procedure to scale. You always learn your scaling parameters on the train and then use them on the test. Here is an article that explane it very well : https://sebastianraschka.com/faq/docs/scale-training-test.html
We have two datasets : The training and the test dataset. Imagine we have just 2 features :
'x1' and 'x2'.
Now consider this (A very hypothetical example):
A sample in the training data has values: 'x1' = 100 and 'x2' = 200 When scaled, 'x1' gets a value of 0.1 and 'x2' a value of 0.1 too. The response variable value is 100 for this. These have been calculated w.r.t only the training data's mean and std.
A sample in the test data has the values : 'x1' = 50 and 'x2' = 100. When scaled according to the test data values, 'x1' = 0.1 and 'x2' = 0.1. This means that our function will predict response variable value of 100 for this sample too. But this is wrong. It shouldn't be 100. It should be predicting something else because the not-scaled values of the features of the 2 samples mentioned above are different and thus point to different response values. We will know what the correct prediction is only when we scale it according to the training data because those are the values that our linear regression function has learned.
I have tried to explain the intuition behind this logic below:
We decide to scale both the features in the training dataset before applying linear regression and fitting the linear regression function. When we scale the features of the training dataset, all 'x1' features get adjusted according to the mean and the standard deviations of the different samples w.r.t to their 'x1' feature values. Same thing happens for 'x2' feature. This essentially means that every feature has been transformed into a new number based on just the training data. It's like Every feature has been given a relative position. Relative to the mean and std of just the training data. So every sample's new 'x1' and 'x2' values are dependent on the mean and the std of the training data only.
Now what happens when we fit the linear regression function is that it learns the parameters (i.e, learns to predict the response values) based on the scaled features of our training dataset. That means that it is learning to predict based on those particular means and standard deviations of 'x1' and 'x2' of the different samples in the training dataset. So the value of the predictions depends on the:
*learned parameters. Which in turn depend on the
*value of the features of the training data (which have been scaled).And because of the scaling the training data's features depend on the
*training data's mean and std.
If we now fit the standardscaler() to the test data, the test data's 'x1' and 'x2' will have their own mean and std. This means that the new values of both the features will in turn be relative to only the data in the test data and thus will have no connection whatsoever to the training data. It's almost like they have been subtracted by and divided by random values and have got new values now which do not convey how they are related to the training data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With