I am having trouble understanding how exactly <code>transform()</code> and <code>fit_transform()</code> are working together. I call <code>fit_transform()</code> on my training data set and <code>transform()</code> on my test set afterwards. However if I call <code>fit_transform()</code> on the test set I get bad results. Can anybody give me an explanation how and why this occurs?

Let's take an example of a transform, sklearn.preprocessing.StandardScaler. From the docs, this will: <blockquote> Standardize features by removing the mean and scaling to unit variance </blockquote> Suppose you're working with code like the following. <pre class="prettyprint"><code>import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # X is features, y is label X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42 ) </code></pre> When you call <code>StandardScaler.fit(X_train)</code>, what it does is calculate the mean and variance from the values in <code>X_train</code>. Then calling <code>.transform()</code> will transform all of the features by subtracting the mean and dividing by the variance. For convenience, these two function calls can be done in one step using <code>fit_transform()</code>. The reason you want to fit the scaler using only the training data is because you don't want to bias your model with information from the test data. If you <code>fit()</code> to your test data, you'd compute a new mean and variance for each feature. In theory these values may be very similar if your test and train sets have the same distribution, but in practice this is typically not the case. Instead, you want to only transform the test data by using the parameters computed on the training data.

fit-transform on training data and transform on test data [duplicate]

1 Answers

Let's take an example of a transform, sklearn.preprocessing.StandardScaler.

From the docs, this will:

Standardize features by removing the mean and scaling to unit variance

Suppose you're working with code like the following.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# X is features, y is label

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
)

When you call StandardScaler.fit(X_train), what it does is calculate the mean and variance from the values in X_train. Then calling .transform() will transform all of the features by subtracting the mean and dividing by the variance. For convenience, these two function calls can be done in one step using fit_transform().

The reason you want to fit the scaler using only the training data is because you don't want to bias your model with information from the test data.

If you fit() to your test data, you'd compute a new mean and variance for each feature. In theory these values may be very similar if your test and train sets have the same distribution, but in practice this is typically not the case.

Instead, you want to only transform the test data by using the parameters computed on the training data.

199

answered Oct 26 '22 06:10

pault

Related questions
                            
                                How do I calculate PDF (probability density function) in Python?
                            
                                Deleting User Messages in Discord.py
                            
                                python: extracting variables from string templates
                            
                                Seaborn Boxplot: get the xtick labels
                            
                                Using networkx to calculate eigenvector centrality
                            
                                Apply textblob in for each row of a dataframe
                            
                                Destroying a Singleton object in Python
                            
                                understanding matplotlib.subplots python [duplicate]
                            
                                Pandas DataFrame mutability
                            
                                How to do zero padding in keras conv layer?
                            
                                python installing package with submodules
                            
                                OSMNx : get coordinates of nodes using OSM id
                            
                                Finding equal values from a list of list of tuples in Python
                            
                                Matplotlib savefig() over multiple graphs keeps saving the same graph
                            
                                prefetch_related for Authenticated user
                            
                                Django: Read uploaded CSV file using FileField instance
                            
                                difference between str(dict) and json.dumps(dict)
                            
                                Creating a mixture of probability distributions for sampling
                            
                                keras bidirectional lstm seq2seq
                            
                                updated object's attribute in python class, but not getting reflected

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

fit-transform on training data and transform on test data [duplicate]

Tags:

python

scikit-learn

b4shyou

People also ask

1 Answers

pault

Recent Activity

Donate For Us