Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does sklearn Imputer need to fit?

I'm really new in this whole machine learning thing and I'm taking an online course on this subject. In this course, the instructors showed the following piece of code:

imputer = Inputer(missing_values = 'Nan', strategy = 'mean', axis=0) imputer = Imputer.fit(X[:, 1:3]) X[:, 1:3] = imputer.transform(X[:, 1:3]) 

I don't really get why this imputer object needs to fit. I mean, I´m just trying to get rid of missing values in my columns by replacing them with the column mean. From the little I know about programming, this is a pretty simple, iterative procedure, and wouldn´t require a model that has to train on data to be accomplished.

Can someone please explain how this imputer thing works and why it requires training to replace some missing values by the column mean? I have read sci-kit's documentation, but it just shows how to use the methods, and not why they´re required.

Thank you.

like image 623
Vinícius Silva Avatar asked Oct 11 '17 15:10

Vinícius Silva


People also ask

What does it mean to fit an imputer?

But before we get started, keep in mind that fitting something like an imputer is different from fitting a whole model. You use an Imputer to handle missing data in your dataset. Imputer gives you easy methods to replace NaNs and blanks with something like the mean of the column or even median.

What does Sklearn imputer do?

SimpleImputer is a scikit-learn class which is helpful in handling the missing data in the predictive model dataset. It replaces the NaN values with a specified placeholder. strategy : The data which will replace the NaN values from the dataset.

Why do we use fit method?

The fit method is calculating the mean and variance of each of the features present in our data. The transform method is transforming all the features using the respective mean and variance. Now, we want scaling to be applied to our test data too and at the same time do not want to be biased with our model.

What is the use of fit transform in machine learning?

fit_transform(): This method performs fit and transform on the input data at a single time and converts the data points. If we use fit and transform separate when we need both then it will decrease the efficiency of the model so we use fit_transform() which will do both the work.


1 Answers

The Imputer fills missing values with some statistics (e.g. mean, median, ...) of the data. To avoid data leakage during cross-validation, it computes the statistic on the train data during the fit, stores it and uses it on the test data, during the transform.

from sklearn.preprocessing import Imputer obj = Imputer(strategy='mean')  obj.fit([[1, 2, 3], [2, 3, 4]]) print(obj.statistics_) # array([ 1.5,  2.5,  3.5])  X = obj.transform([[4, np.nan, 6], [5, 6, np.nan]]) print(X) # array([[ 4. ,  2.5,  6. ], #        [ 5. ,  6. ,  3.5]]) 

You can do both steps in one if your train and test data are identical, using fit_transform.

X = obj.fit_transform([[1, 2, np.nan], [2, 3, 4]]) print(X) # array([[ 1. ,  2. ,  4. ], #        [ 2. ,  3. ,  4. ]]) 

This data leakage issue is important, since the data distribution may change from the training data to the testing data, and you don't want the information of the testing data to be already present during the fit.

See the doc for more information about cross-validation.

like image 89
TomDLT Avatar answered Nov 04 '22 12:11

TomDLT