Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementation of sklearn.impute.IterativeImputer

Consider data which contains some nan below:

Column-1    Column-2    Column-3    Column-4    Column-5
0   NaN 15.0    63.0    8.0 40.0
1   60.0    51.0    NaN 54.0    31.0
2   15.0    17.0    55.0    80.0    NaN
3   54.0    43.0    70.0    16.0    73.0
4   94.0    31.0    94.0    29.0    53.0
5   99.0    52.0    77.0    91.0    58.0
6   84.0    19.0    36.0    NaN 97.0
7   41.0    91.0    62.0    67.0    68.0
8   44.0    38.0    27.0    53.0    37.0
9   58.0    NaN 63.0    57.0    28.0
10  66.0    68.0    89.0    36.0    47.0
11  7.0 81.0    5.0 99.0    16.0
12  43.0    55.0    64.0    88.0    NaN
13  8.0 90.0    91.0    44.0    4.0
14  29.0    52.0    94.0    71.0    47.0
15  22.0    21.0    68.0    61.0    38.0
16  76.0    36.0    70.0    99.0    50.0
17  38.0    31.0    66.0    79.0    99.0
18  94.0    22.0    92.0    39.0    58.0

I want to replace nan in the data using sklearn.impute.IterativeImputer. A friend helped me with the code below:

imp = IterativeImputer(missing_values=np.nan, sample_posterior=False, 
                                 max_iter=10, tol=0.001, 
                                 n_nearest_features=4, initial_strategy='median')
imp.fit(data)
imputed_data = pd.DataFrame(data=imp.transform(data), 
                             columns=['Column-1', 'Column-2', 'Column-3', 'Column-4', 'Column-5'],
                             dtype='int')

The imputed_data is:


Column-1    Column-2    Column-3    Column-4    Column-5
0   59  15  63  8   40
1   60  51  66  54  31
2   15  17  55  80  48
3   54  43  70  16  73
4   94  31  94  29  53
5   99  52  77  91  58
6   84  19  36  59  97
7   41  91  62  67  68
8   44  38  27  53  37
9   58  46  63  57  28
10  66  68  89  36  47
11  7   81  5   99  16
12  43  55  64  88  47
13  8   90  91  44  4
14  29  52  94  71  47
15  22  21  68  61  38
16  76  36  70  99  50
17  38  31  66  79  99
18  94  22  92  39  58

From the IterativeImputer documentation, the default estimator is BayesianRidge(). But if I use other estimators such as estimator=ExtraTreesRegressor(n_estimators=10, random_state=0) like in the code below, it returns a warning message. The code:

imp = IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=10, random_state=0), missing_values=np.nan, sample_posterior=False, 
                                 max_iter=10, tol=0.001, 
                                 n_nearest_features=4, initial_strategy='median')
imp.fit(data)

The message:

C:\Users\...\sklearn\impute\_iterative.py:599: ConvergenceWarning: [IterativeImputer] Early stopping criterion not reached. " reached.", ConvergenceWarning).

My question: is this a correct approach or should I do something to fix the warning message?
Thank you.

like image 330
k.ko3n Avatar asked Jul 22 '19 21:07

k.ko3n


People also ask

What is Sklearn impute?

The imputation strategy. If “mean”, then replace missing values using the mean along the axis. If “median”, then replace missing values using the median along the axis. If “most_frequent”, then replace missing using the most frequent value along the axis.

How does an iterative Imputer work?

Iterative Imputer initially initializes the missing values with the value passed for initial_strategy , where the initial strategy is the “mean” for each feature. The imputer then uses an estimator (where the default estimator used is Bayesian Ridge) at each step of the round-robin imputation.

How do you impute null values?

Maximum-Likelihood: In this method, first all the null values are removed from the data. Then the distribution of the column is finded. Then the Parameters corresponding to the distribution(mean and standard deviation) is calculated. and then the missing values are imputed by sampling points from that distribution.


2 Answers

They are having the same issue here:

https://github.com/scikit-learn/scikit-learn/issues/14338

like image 110
mel1 Avatar answered Oct 17 '22 02:10

mel1


You are getting this error because of the parameters max_iter=10 & tol=0.001set for IterativeImputer().

The stopping criterion (abs(max(X_t - X_{t-1}))/abs(max(X[known_vals])) < tol) is not met for 10 number of iterations(max_iter=10).

Refer to the description of max_iter in the parameters section of sklearn.impute.IterativeImputer documentation.

One workaround to overcome this error is setting the max_iter parameter value higher.

like image 3
akhil penta Avatar answered Oct 17 '22 01:10

akhil penta