Fitting data vs. transforming data in scikit-learn

Tags:

scikit-learn

In scikit-learn, all estimators have a fit() method, and depending on whether they are supervised or unsupervised, they also have a predict() or transform() method.

I am in the process of writing a transformer for an unsupervised learning task and was wondering if there is a rule of thumb where to put which kind of learning logic. The official documentation is not very helpful in this regard:

fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.

In this context, what is meant by both fitting data and transforming data?

725

asked Jul 22 '15 19:07

zepp133

1 Answers

Fitting finds the internal parameters of a model that will be used to transform data. Transforming applies the parameters to data. You may fit a model to one set of data, and then transform it on a completely different set.

For example, you fit a linear model to data to get a slope and intercept. Then you use those parameters to transform (i.e., map) new or existing values of x to y.

fit_transform is just doing both steps to the same data.

A scikit example: You fit data to find the principal components. Then you transform your data to see how it maps onto these components:

from sklearn.decomposition import PCA  pca = PCA(n_components=2)  X = [[1,2],[2,4],[1,3]]  pca.fit(X)  # This is the model to map data pca.components_  array([[ 0.47185791,  0.88167459],        [-0.88167459,  0.47185791]], dtype=float32)  # Now we actually map the data pca.transform(X)  array([[-1.03896057, -0.17796634],        [ 1.19624651, -0.11592512],        [-0.15728599,  0.29389156]])  # Or we can do both "at once" pca.fit_transform(X)  array([[-1.03896058, -0.1779664 ],        [ 1.19624662, -0.11592512],        [-0.15728603,  0.29389152]], dtype=float32)

200

answered Oct 03 '22 01:10

inversion

Related questions
                            
                                How to graph grid scores from GridSearchCV?
                            
                                Large scale machine learning - Python or Java? [closed]
                            
                                What is the difference between SVC and SVM in scikit-learn?
                            
                                Help Understanding Cross Validation and Decision Trees
                            
                                What makes the distance measure in k-medoid "better" than k-means?
                            
                                Playground for Artificial Intelligence?
                            
                                Dealing with unbalanced datasets in Spark MLlib
                            
                                A guide to convert_imageset.cpp
                            
                                Getting No loop matching the specified signature and casting error
                            
                                Controlling the threshold in Logistic Regression in Scikit Learn
                            
                                Fastest SVM implementation usable in Python [closed]
                            
                                Python NLTK pos_tag not returning the correct part-of-speech tag
                            
                                Why is my GPU slower than CPU when training LSTM/RNN models?
                            
                                Missing values in scikits machine learning
                            
                                How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?
                            
                                Getting TypeError: '(slice(None, None, None), 0)' is an invalid key
                            
                                Altering trained images to train neural network
                            
                                How to make virtual organisms learn using neural networks? [closed]
                            
                                Feature selection using scikit-learn
                            
                                sklearn metrics for multiclass classification

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With