Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the Multivariate imputer in scikit-learn differ from the Simple imputer?

I have a matrix of data with missing values that I am trying to impute, and I am looking at the options for different imputers and checking to see what settings would work best for the biological context I am working in. I understand the knnimpute function in matlab and the simple imputer in scikit-learn. However, I'm not quite sure my understanding of the iterative imputer is correct.

I have looked at the documentation at this site for the multivariate/iterative imputer -- https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html

I don't understand the explanation of the algorithm, as round-robin. Does the imputer use the characteristics of both the column and rows in the matrix to determine the "value" of a missing data point? Then taking that approach one random missing data point at a time to avoid shifting the data unnaturally towards the characteristics of a previously imputed data point?

like image 606
Kangaroo Avatar asked Jan 26 '23 01:01

Kangaroo


1 Answers

My understanding of the algorithms are as follows:

Simple Imputer

The simple Imputer uses the non missing values in each column to estimate the missing values.

For example if you had a column like age with 10% missing values. It would find the mean age and replace all missing in the age column with that value.

It supports several different methods of imputation such as median and mode(most_common) as well as a constant value you define yourself. These last two can also be used on categorical values.

df = pd.DataFrame({'A':[np.nan,2,3],
               'B':[3,5,np.nan],
               'C':[0,np.nan,3],
               'D':[2,6,3]})
print(df)    

   A    B    C    D
0  NaN  3.0  0.0  2
1  2.0  5.0  NaN  6
2  3.0  NaN  3.0  3

imp = SimpleImputer()
imp.fit_transform(df)

array([[2.5, 3. , 0. , 2. ],
   [2. , 5. , 1.5, 6. ],
   [3. , 4. , 3. , 3. ]])

As you can see the imputed values are simply the mean value for each column

iterative Imputer

The Iterative Imputer can do a number of different things depending upon how you configure it. This explanation assumes the default values.

Original Data
   A    B    C    D
0  NaN  3.0  0.0  2
1  2.0  5.0  NaN  6
2  3.0  NaN  3.0  3

Firstly It does the same thing as the simple imputer e.g. simple imputes the missing values based upon the initial_strategy parameter(default = Mean).

Initial Pass
   A    B    C    D
0  2.5  3.0  0.0  2
1  2.0  5.0  1.5  6
2  3.0  4.0  3.0  3

Secondly it trains the estimator passed in (default = Bayesian_ridge) as a predictor. In our case we have columns; A,B,C,D. So the classifier would fit a model with independent variables A,B,C and dependent variable D

X = df[['A','B','C']]
y = df[['D']]
model = BaysianRidge.fit(X,y)

Then it calls the predict method of this newly fitted model for the values that are flagged as imputed and replaces them.

model.predict(df[df[D = 'imputed_mask']])

This method is repeated for all combinations of columns(the round robin described in the docs) e.g.

X = df[['B','C','D']]
y = df[['A']]
...

X = df[['A','C','D']]
y = df[['B']]
...   

X = df[['A','B','D']]
y = df[['C']]    
...

This round robin of training an estimator on each combination of columns makes up one pass. This process is repeated until either the stopping tolerance is met or until the iterator reaches the max number of iterations(default = 10)

so if we run for three passes it looks like this:

Original Data
   A    B    C    D
0  NaN  3.0  0.0  2
1  2.0  5.0  NaN  6
2  3.0  NaN  3.0  3

Initial (simple) Pass
   A    B    C    D
0  2.5  3.0  0.0  2
1  2.0  5.0  1.5  6
2  3.0  4.0  3.0  3


pass_1
[[3.55243135 3.         0.         2.        ]
[2.         5.         7.66666393 6.        ]
[3.         3.7130697  3.         3.        ]]

pass_2
[[ 3.39559017  3.          0.          2.        ]
[ 2.          5.         10.39409964  6.        ]
[ 3.          3.57003864  3.          3.        ]]

pass_3
[[ 3.34980014  3.          0.          2.        ]
 [ 2.          5.         11.5269743   6.        ]
 [ 3.          3.51894112  3.          3.        ]]

Obviously it doesn't work great for such a small example because there isn't enough data to fit the estimator on so with a smaller data-set it may be best to use the simple impute method.

like image 50
counterpig Avatar answered Jan 28 '23 15:01

counterpig