Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When scale the data, why the train dataset use 'fit' and 'transform', but the test dataset only use 'transform'?

When scale the data, why the train dataset use 'fit' and 'transform', but the test dataset only use 'transform'?

SAMPLE_COUNT = 5000 TEST_COUNT = 20000 seed(0) sample = list() test_sample = list() for index, line in enumerate(open('covtype.data','rb')):     if index < SAMPLE_COUNT:         sample.append(line)     else:         r = randint(0,index)         if r < SAMPLE_COUNT:             sample[r] = line         else:             k = randint(0,index)             if k < TEST_COUNT:                 if len(test_sample) < TEST_COUNT:                     test_sample.append(line)                 else:                     test_sample[k] = line from sklearn.preprocessing import StandardScaler for n, line in enumerate(sample): sample[n] = map(float, line.strip().split(',')) y = np.array(sample)[:,-1] scaling = StandardScaler()  X = scaling.fit_transform(np.array(sample)[:,:-1]) ##here use fit and transform  for n,line in enumerate(test_sample): test_sample[n] = map(float,line.strip().split(',')) yt = np.array(test_sample)[:,-1]  Xt = scaling.transform(np.array(test_sample)[:,:-1])##why here only use transform 

As the annotation says, why Xt only use transform but no fit?

like image 483
littlely Avatar asked Apr 28 '17 08:04

littlely


People also ask

Why we use Fit_transform on the train set but just transform on the test set?

fit_transform() is used on the training data so that we can scale the training data and also learn the scaling parameters of that data. Here, the model built by us will learn the mean and variance of the features of the training set. These learned parameters are then used to scale our test data.

Why should we use different data to train and to test a model?

By using similar data for training and testing, you can minimize the effects of data discrepancies and better understand the characteristics of the model. After a model has been processed by using the training set, you test the model by making predictions against the test set.

Why We Use fit and transform?

The fit(data) method is used to compute the mean and std dev for a given feature to be used further for scaling. The transform(data) method is used to perform scaling using mean and std dev calculated using the . fit() method.

Why should the test set be used only once?

You aren't going to be able to get “another test set” easily, so you want the test set that you have to be used once so that it provides the best possible estimate of the model's generalization ability.

Why do we use fit_transform () and transform () on the test data?

Show activity on this post. We use fit_transform () on the train data so that we learn the parameters of scaling on the train data and in the same time we scale the train data. We only use transform () on the test data because we use the scaling paramaters learned on the train data to scale the test data.

Why is it important to include the test dataset in transform?

Including the test dataset in the transform computation will allow information to flow from the test data to the train data and therefore to the model that learns from it, thus allowing the model to cheat (introducing a bias). Also, it is important not to confuse transformations with augmentations.

What is the difference between train and test datasets?

Therefore, train and test datasets are the two key concepts of machine learning, where the training dataset is used to fit the model, and the test dataset is used to evaluate the model. In this topic, we are going to discuss train and test datasets along with the difference between both of them.

Why do we scale the features in the training dataset?

We decide to scale both the features in the training dataset before applying linear regression and fitting the linear regression function. When we scale the features of the training dataset, all 'x1' features get adjusted according to the mean and the standard deviations of the different samples w. r. t to their 'x1' feature values.


2 Answers

We use fit_transform() on the train data so that we learn the parameters of scaling on the train data and in the same time we scale the train data. We only use transform() on the test data because we use the scaling paramaters learned on the train data to scale the test data.

This is the standart procedure to scale. You always learn your scaling parameters on the train and then use them on the test. Here is an article that explane it very well : https://sebastianraschka.com/faq/docs/scale-training-test.html

like image 174
BenDes Avatar answered Sep 29 '22 10:09

BenDes


We have two datasets : The training and the test dataset. Imagine we have just 2 features :

'x1' and 'x2'.

Now consider this (A very hypothetical example):

A sample in the training data has values: 'x1' = 100 and 'x2' = 200 When scaled, 'x1' gets a value of 0.1 and 'x2' a value of 0.1 too. The response variable value is 100 for this. These have been calculated w.r.t only the training data's mean and std.

A sample in the test data has the values : 'x1' = 50 and 'x2' = 100. When scaled according to the test data values, 'x1' = 0.1 and 'x2' = 0.1. This means that our function will predict response variable value of 100 for this sample too. But this is wrong. It shouldn't be 100. It should be predicting something else because the not-scaled values of the features of the 2 samples mentioned above are different and thus point to different response values. We will know what the correct prediction is only when we scale it according to the training data because those are the values that our linear regression function has learned.

I have tried to explain the intuition behind this logic below:

We decide to scale both the features in the training dataset before applying linear regression and fitting the linear regression function. When we scale the features of the training dataset, all 'x1' features get adjusted according to the mean and the standard deviations of the different samples w.r.t to their 'x1' feature values. Same thing happens for 'x2' feature. This essentially means that every feature has been transformed into a new number based on just the training data. It's like Every feature has been given a relative position. Relative to the mean and std of just the training data. So every sample's new 'x1' and 'x2' values are dependent on the mean and the std of the training data only.

Now what happens when we fit the linear regression function is that it learns the parameters (i.e, learns to predict the response values) based on the scaled features of our training dataset. That means that it is learning to predict based on those particular means and standard deviations of 'x1' and 'x2' of the different samples in the training dataset. So the value of the predictions depends on the:

*learned parameters. Which in turn depend on the

*value of the features of the training data (which have been scaled).And because of the scaling the training data's features depend on the

*training data's mean and std.

If we now fit the standardscaler() to the test data, the test data's 'x1' and 'x2' will have their own mean and std. This means that the new values of both the features will in turn be relative to only the data in the test data and thus will have no connection whatsoever to the training data. It's almost like they have been subtracted by and divided by random values and have got new values now which do not convey how they are related to the training data.

like image 42
aiish Avatar answered Sep 29 '22 10:09

aiish