Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fitting data vs. transforming data in scikit-learn

In scikit-learn, all estimators have a fit() method, and depending on whether they are supervised or unsupervised, they also have a predict() or transform() method.

I am in the process of writing a transformer for an unsupervised learning task and was wondering if there is a rule of thumb where to put which kind of learning logic. The official documentation is not very helpful in this regard:

fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.

In this context, what is meant by both fitting data and transforming data?

like image 725
zepp133 Avatar asked Jul 22 '15 19:07

zepp133


People also ask

What is the difference between fit and transform in Python?

The fit method is calculating the mean and variance of each of the features present in our data. The transform method is transforming all the features using the respective mean and variance.

What is fitting in Sklearn?

The scikit learn 'fit' method is one of those tools. The 'fit' method trains the algorithm on the training data, after the model is initialized. That's really all it does. So the sklearn fit method uses the training data as an input to train the machine learning model.

What is the difference between fit Fit_transform and predict methods?

fit() - It calculates the parameters/weights on training data (e.g. parameters returned by coef() in case of Linear Regression) and saves them as an internal objects state. predict() - Use the above calculated weights on test data to make the predictions. transform() - Cannot be used. fit_transform() - Cannot be used.

How do you fit transform data?

To put it simply, you can use the fit_transform() method on the training set, as you'll need to both fit and transform the data, and you can use the fit() method on the training dataset to get the value, and later transform() test data with it. Let me know if you have any comments or are not able to understand it.


1 Answers

Fitting finds the internal parameters of a model that will be used to transform data. Transforming applies the parameters to data. You may fit a model to one set of data, and then transform it on a completely different set.

For example, you fit a linear model to data to get a slope and intercept. Then you use those parameters to transform (i.e., map) new or existing values of x to y.

fit_transform is just doing both steps to the same data.

A scikit example: You fit data to find the principal components. Then you transform your data to see how it maps onto these components:

from sklearn.decomposition import PCA  pca = PCA(n_components=2)  X = [[1,2],[2,4],[1,3]]  pca.fit(X)  # This is the model to map data pca.components_  array([[ 0.47185791,  0.88167459],        [-0.88167459,  0.47185791]], dtype=float32)  # Now we actually map the data pca.transform(X)  array([[-1.03896057, -0.17796634],        [ 1.19624651, -0.11592512],        [-0.15728599,  0.29389156]])  # Or we can do both "at once" pca.fit_transform(X)  array([[-1.03896058, -0.1779664 ],        [ 1.19624662, -0.11592512],        [-0.15728603,  0.29389152]], dtype=float32) 
like image 200
inversion Avatar answered Oct 03 '22 01:10

inversion