Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to impute missing values in machine learning?

For personal knowledge, I've been trying out different imputation methods other than the mean/median/mode. I was able to try out KNN, MICE, median imputational methods so far. I was told that imputation by clustering method can also be done and my internet search to find a package that does it came up with just research papers.

I'm running these imputational methods on Iris dataset by delibrately creating missing values in it (since Iris has no missing values). My approach for other methods is as follows:

data = pd.read_csv("D:/Iris_classification/train.csv")

#Shuffle the data and reset the index
from sklearn.utils import shuffle
data = shuffle(data).reset_index(drop = True)  

#Create Independent and dependent matrices
X = data.iloc[:, [0, 1, 2, 3]].values 
y = data.iloc[:, 4].values

#train_test_split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 50, random_state = 0)

#Standardize the data
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()

X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

#Impute missing values at random
prop = int(X_train.size * 0.5) #Set the % of values to be replaced
prop1 = int(X_test.size * 0.5)

a = [random.choice(range(X_train.shape[0])) for _ in range(prop)] #Randomly choose indices of the numpy array
b = [random.choice(range(X_train.shape[1])) for _ in range(prop)]

X1_train[a, b] = np.NaN
X1_test[c, d] = np.NaN

And then for KNN imputation, I've done

X_train_filled = KNN(3).complete(X_train)
X_test_filled = KNN(3).complete(X_test

Is there a way to impute missing values by clustering method? Also, StandardScaler() doesn't work when there are NaN values in it. Are there any other methods to standardize the data?

like image 839
uharsha33 Avatar asked Apr 16 '18 10:04

uharsha33


1 Answers

The main problem that we have to deal with is the case where you have some missing data.

First of all, I need tell you that removing "problem" lines could be quite dangerous because they can contains crucial information.

Is there a way to impute missing values by clustering?

Yes, you can replace the missing data by the mean of all the values in the column.

You can do this using Inputer class from sklearn.preprocessing library.

from sklearn.preprocessing import Imputer
inputer = Inputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
inputer = inputer.fit(X)
X = inputer.transform(X)

You have to use this method right after "Create Independent and dependent matrices" , before scaling and others.

I created below a simple example for you in order to show you how it works:

Before

enter image description here

After

enter image description here

like image 67
Mihai Alexandru-Ionut Avatar answered Sep 18 '22 20:09

Mihai Alexandru-Ionut