For personal knowledge, I've been trying out different imputation methods other than the mean/median/mode. I was able to try out KNN, MICE, median imputational methods so far. I was told that imputation by clustering method can also be done and my internet search to find a package that does it came up with just research papers.
I'm running these imputational methods on Iris dataset by delibrately creating missing values in it (since Iris has no missing values). My approach for other methods is as follows:
data = pd.read_csv("D:/Iris_classification/train.csv")
#Shuffle the data and reset the index
from sklearn.utils import shuffle
data = shuffle(data).reset_index(drop = True)
#Create Independent and dependent matrices
X = data.iloc[:, [0, 1, 2, 3]].values
y = data.iloc[:, 4].values
#train_test_split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 50, random_state = 0)
#Standardize the data
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
#Impute missing values at random
prop = int(X_train.size * 0.5) #Set the % of values to be replaced
prop1 = int(X_test.size * 0.5)
a = [random.choice(range(X_train.shape[0])) for _ in range(prop)] #Randomly choose indices of the numpy array
b = [random.choice(range(X_train.shape[1])) for _ in range(prop)]
X1_train[a, b] = np.NaN
X1_test[c, d] = np.NaN
And then for KNN imputation, I've done
X_train_filled = KNN(3).complete(X_train)
X_test_filled = KNN(3).complete(X_test
Is there a way to impute missing values by clustering method? Also, StandardScaler() doesn't work when there are NaN values in it. Are there any other methods to standardize the data?
The main problem that we have to deal with is the case where you have some missing data.
First of all, I need tell you that removing "problem" lines could be quite dangerous because they can contains crucial information.
Is there a way to impute missing values by clustering?
Yes, you can replace the missing data by the mean of all the values in the column.
You can do this using Inputer
class from sklearn.preprocessing
library.
from sklearn.preprocessing import Imputer
inputer = Inputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
inputer = inputer.fit(X)
X = inputer.transform(X)
You have to use this method right after "Create Independent and dependent matrices" , before scaling and others.
I created below a simple example for you in order to show you how it works:
Before
After
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With