When scale the data, why the train dataset use 'fit' and 'transform', but the test dataset only use 'transform'?

Tags:

scikit-learn

SAMPLE_COUNT = 5000 TEST_COUNT = 20000 seed(0) sample = list() test_sample = list() for index, line in enumerate(open('covtype.data','rb')):     if index < SAMPLE_COUNT:         sample.append(line)     else:         r = randint(0,index)         if r < SAMPLE_COUNT:             sample[r] = line         else:             k = randint(0,index)             if k < TEST_COUNT:                 if len(test_sample) < TEST_COUNT:                     test_sample.append(line)                 else:                     test_sample[k] = line from sklearn.preprocessing import StandardScaler for n, line in enumerate(sample): sample[n] = map(float, line.strip().split(',')) y = np.array(sample)[:,-1] scaling = StandardScaler()  X = scaling.fit_transform(np.array(sample)[:,:-1]) ##here use fit and transform  for n,line in enumerate(test_sample): test_sample[n] = map(float,line.strip().split(',')) yt = np.array(test_sample)[:,-1]  Xt = scaling.transform(np.array(test_sample)[:,:-1])##why here only use transform

As the annotation says, why Xt only use transform but no fit?

483

asked Apr 28 '17 08:04

2 Answers

We use fit_transform() on the train data so that we learn the parameters of scaling on the train data and in the same time we scale the train data. We only use transform() on the test data because we use the scaling paramaters learned on the train data to scale the test data.

This is the standart procedure to scale. You always learn your scaling parameters on the train and then use them on the test. Here is an article that explane it very well : https://sebastianraschka.com/faq/docs/scale-training-test.html

174

answered Sep 29 '22 10:09

BenDes

We have two datasets : The training and the test dataset. Imagine we have just 2 features :

'x1' and 'x2'.

Now consider this (A very hypothetical example):

A sample in the training data has values: 'x1' = 100 and 'x2' = 200 When scaled, 'x1' gets a value of 0.1 and 'x2' a value of 0.1 too. The response variable value is 100 for this. These have been calculated w.r.t only the training data's mean and std.

A sample in the test data has the values : 'x1' = 50 and 'x2' = 100. When scaled according to the test data values, 'x1' = 0.1 and 'x2' = 0.1. This means that our function will predict response variable value of 100 for this sample too. But this is wrong. It shouldn't be 100. It should be predicting something else because the not-scaled values of the features of the 2 samples mentioned above are different and thus point to different response values. We will know what the correct prediction is only when we scale it according to the training data because those are the values that our linear regression function has learned.

I have tried to explain the intuition behind this logic below:

We decide to scale both the features in the training dataset before applying linear regression and fitting the linear regression function. When we scale the features of the training dataset, all 'x1' features get adjusted according to the mean and the standard deviations of the different samples w.r.t to their 'x1' feature values. Same thing happens for 'x2' feature. This essentially means that every feature has been transformed into a new number based on just the training data. It's like Every feature has been given a relative position. Relative to the mean and std of just the training data. So every sample's new 'x1' and 'x2' values are dependent on the mean and the std of the training data only.

Now what happens when we fit the linear regression function is that it learns the parameters (i.e, learns to predict the response values) based on the scaled features of our training dataset. That means that it is learning to predict based on those particular means and standard deviations of 'x1' and 'x2' of the different samples in the training dataset. So the value of the predictions depends on the:

*learned parameters. Which in turn depend on the

*value of the features of the training data (which have been scaled).And because of the scaling the training data's features depend on the

*training data's mean and std.

If we now fit the standardscaler() to the test data, the test data's 'x1' and 'x2' will have their own mean and std. This means that the new values of both the features will in turn be relative to only the data in the test data and thus will have no connection whatsoever to the training data. It's almost like they have been subtracted by and divided by random values and have got new values now which do not convey how they are related to the training data.

answered Sep 29 '22 10:09

aiish

Related questions
                            
                                How can I retrieve the TLS/SSL peer certificate of a remote host using python?
                            
                                Selecting rows from a Pandas dataframe with a compound (hierarchical) index
                            
                                separate real and imaginary part of a complex number in python
                            
                                Different meanings of brackets in Python
                            
                                Is there a way to auto generate a __str__() implementation in python?
                            
                                How to use `GridSpec()` with `subplots()`
                            
                                Is there a dedicated way to get the number of items in a python `Enum`?
                            
                                How to use advanced activation layers in Keras?
                            
                                Pandas concat failing
                            
                                Tensorflow Different ways to Export and Run graph in C++
                            
                                Applying pandas qcut bins to new data
                            
                                concurrent.futures.ProcessPoolExecutor vs multiprocessing.pool.Pool [duplicate]
                            
                                Seaborn plots in a loop
                            
                                Use sklearn's GridSearchCV with a pipeline, preprocessing just once
                            
                                How to find out the arity of a method in Python
                            
                                Getting the root (head) of a DiGraph in networkx (Python)
                            
                                python 3.2 error saying urllib.parse.urlencode() is not defined
                            
                                Invert an axis in a matplotlib grafic
                            
                                How to mock nested functions?
                            
                                How can I test a Flask application which uses SQLAlchemy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When scale the data, why the train dataset use 'fit' and 'transform', but the test dataset only use 'transform'?

Tags:

python

scikit-learn

littlely

People also ask

2 Answers

BenDes

aiish

Recent Activity

Donate For Us