Consider data
which contains some nan below:
Column-1 Column-2 Column-3 Column-4 Column-5
0 NaN 15.0 63.0 8.0 40.0
1 60.0 51.0 NaN 54.0 31.0
2 15.0 17.0 55.0 80.0 NaN
3 54.0 43.0 70.0 16.0 73.0
4 94.0 31.0 94.0 29.0 53.0
5 99.0 52.0 77.0 91.0 58.0
6 84.0 19.0 36.0 NaN 97.0
7 41.0 91.0 62.0 67.0 68.0
8 44.0 38.0 27.0 53.0 37.0
9 58.0 NaN 63.0 57.0 28.0
10 66.0 68.0 89.0 36.0 47.0
11 7.0 81.0 5.0 99.0 16.0
12 43.0 55.0 64.0 88.0 NaN
13 8.0 90.0 91.0 44.0 4.0
14 29.0 52.0 94.0 71.0 47.0
15 22.0 21.0 68.0 61.0 38.0
16 76.0 36.0 70.0 99.0 50.0
17 38.0 31.0 66.0 79.0 99.0
18 94.0 22.0 92.0 39.0 58.0
I want to replace nan in the data
using sklearn.impute.IterativeImputer
. A friend helped me with the code below:
imp = IterativeImputer(missing_values=np.nan, sample_posterior=False,
max_iter=10, tol=0.001,
n_nearest_features=4, initial_strategy='median')
imp.fit(data)
imputed_data = pd.DataFrame(data=imp.transform(data),
columns=['Column-1', 'Column-2', 'Column-3', 'Column-4', 'Column-5'],
dtype='int')
The imputed_data
is:
Column-1 Column-2 Column-3 Column-4 Column-5
0 59 15 63 8 40
1 60 51 66 54 31
2 15 17 55 80 48
3 54 43 70 16 73
4 94 31 94 29 53
5 99 52 77 91 58
6 84 19 36 59 97
7 41 91 62 67 68
8 44 38 27 53 37
9 58 46 63 57 28
10 66 68 89 36 47
11 7 81 5 99 16
12 43 55 64 88 47
13 8 90 91 44 4
14 29 52 94 71 47
15 22 21 68 61 38
16 76 36 70 99 50
17 38 31 66 79 99
18 94 22 92 39 58
From the IterativeImputer
documentation, the default estimator is BayesianRidge()
. But if I use other estimators such as estimator=ExtraTreesRegressor(n_estimators=10, random_state=0)
like in the code below, it returns a warning message.
The code:
imp = IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=10, random_state=0), missing_values=np.nan, sample_posterior=False,
max_iter=10, tol=0.001,
n_nearest_features=4, initial_strategy='median')
imp.fit(data)
The message:
C:\Users\...\sklearn\impute\_iterative.py:599: ConvergenceWarning: [IterativeImputer] Early stopping criterion not reached. " reached.", ConvergenceWarning).
My question: is this a correct approach or should I do something to fix the warning message?
Thank you.
The imputation strategy. If “mean”, then replace missing values using the mean along the axis. If “median”, then replace missing values using the median along the axis. If “most_frequent”, then replace missing using the most frequent value along the axis.
Iterative Imputer initially initializes the missing values with the value passed for initial_strategy , where the initial strategy is the “mean” for each feature. The imputer then uses an estimator (where the default estimator used is Bayesian Ridge) at each step of the round-robin imputation.
Maximum-Likelihood: In this method, first all the null values are removed from the data. Then the distribution of the column is finded. Then the Parameters corresponding to the distribution(mean and standard deviation) is calculated. and then the missing values are imputed by sampling points from that distribution.
They are having the same issue here:
https://github.com/scikit-learn/scikit-learn/issues/14338
You are getting this error because of the parameters max_iter=10
& tol=0.001
set for IterativeImputer()
.
The stopping criterion (abs(max(X_t - X_{t-1}))/abs(max(X[known_vals])) < tol
) is not met for 10 number of iterations(max_iter=10
).
Refer to the description of max_iter
in the parameters section of sklearn.impute.IterativeImputer
documentation.
One workaround to overcome this error is setting the max_iter
parameter value higher.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With