I am writing a very basic program to predict missing values in a dataset using scikit-learn's Imputer class.
I have made a NumPy array, created an Imputer object with strategy='mean' and performed fit_transform() on the NumPy array.
When I print the array after performing fit_transform(), the 'Nan's remain, and I dont get any prediction.
What am I doing wrong here? How do I go about predicting the missing values?
import numpy as np
from sklearn.preprocessing import Imputer
X = np.array([[23.56],[53.45],['NaN'],[44.44],[77.78],['NaN'],[234.44],[11.33],[79.87]])
print X
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit_transform(X)
print X
The scikit-learn library provides two mechanisms to deal with missing values: Univariate Feature Imputation. Multivariate Feature Imputation. Nearest neighbors imputation.
The imputation strategy. If “mean”, then replace missing values using the mean along the axis. If “median”, then replace missing values using the median along the axis. If “most_frequent”, then replace missing using the most frequent value along the axis.
Imputation for completing missing values using k-Nearest Neighbors. Each sample's missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.
We can use SimpleImputer function from scikit-learn to replace missing values with a fill value. SimpleImputer function has a parameter called strategy that gives us four possibilities to choose the imputation method: strategy='mean' replaces missing values using the mean of the column.
Per the documentation, sklearn.preprocessing.Imputer.fit_transform
returns a new array, it doesn't alter the argument array. The minimal fix is therefore:
X = imp.fit_transform(X)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With