In scikit-learn, all estimators have a fit()
method, and depending on whether they are supervised or unsupervised, they also have a predict()
or transform()
method.
I am in the process of writing a transformer for an unsupervised learning task and was wondering if there is a rule of thumb where to put which kind of learning logic. The official documentation is not very helpful in this regard:
fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
In this context, what is meant by both fitting data and transforming data?
The fit method is calculating the mean and variance of each of the features present in our data. The transform method is transforming all the features using the respective mean and variance.
The scikit learn 'fit' method is one of those tools. The 'fit' method trains the algorithm on the training data, after the model is initialized. That's really all it does. So the sklearn fit method uses the training data as an input to train the machine learning model.
fit() - It calculates the parameters/weights on training data (e.g. parameters returned by coef() in case of Linear Regression) and saves them as an internal objects state. predict() - Use the above calculated weights on test data to make the predictions. transform() - Cannot be used. fit_transform() - Cannot be used.
To put it simply, you can use the fit_transform() method on the training set, as you'll need to both fit and transform the data, and you can use the fit() method on the training dataset to get the value, and later transform() test data with it. Let me know if you have any comments or are not able to understand it.
Fitting finds the internal parameters of a model that will be used to transform data. Transforming applies the parameters to data. You may fit a model to one set of data, and then transform it on a completely different set.
For example, you fit a linear model to data to get a slope and intercept. Then you use those parameters to transform (i.e., map) new or existing values of x
to y
.
fit_transform
is just doing both steps to the same data.
A scikit example: You fit data to find the principal components. Then you transform your data to see how it maps onto these components:
from sklearn.decomposition import PCA pca = PCA(n_components=2) X = [[1,2],[2,4],[1,3]] pca.fit(X) # This is the model to map data pca.components_ array([[ 0.47185791, 0.88167459], [-0.88167459, 0.47185791]], dtype=float32) # Now we actually map the data pca.transform(X) array([[-1.03896057, -0.17796634], [ 1.19624651, -0.11592512], [-0.15728599, 0.29389156]]) # Or we can do both "at once" pca.fit_transform(X) array([[-1.03896058, -0.1779664 ], [ 1.19624662, -0.11592512], [-0.15728603, 0.29389152]], dtype=float32)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With