Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MatLab Missing data handling in categorical data

Tags:

nan

matlab

I am trying to put my dataset into the MATLAB [ranked,weights] = relieff(X,Ylogical,10, 'categoricalx', 'on') function to rank the importance of my predictor features. The dataset<double n*m> has n observations and m discrete (i.e. categorical) features. It happens that each observation (row) in my dataset has at least one NaN value. These NaNs represent unobserved, i.e. missing or null, predictor values in the dataset. (There is no corruption in the dataset, it is just incomplete.)

relieff() uses this function below to remove any rows that contain a NaN:

function [X,Y] = removeNaNs(X,Y)
% Remove observations with missing data
NaNidx = bsxfun(@or,isnan(Y),any(isnan(X),2));
X(NaNidx,:) = [];
Y(NaNidx,:) = [];

This is not ideal, especially for my case, since it leaves me with X=[] and Y=[] (i.e. no observations!)

In this case:

1) Would replacing all NaN's with a random value, e.g. 99999, help? By doing this, I am introducing a new feature state for all the predictor features so I guess it is not ideal.

2) or is replacing NaNs with the mode of the corresponding feature column vector (as below) statistically more sound? (I am not vectorising for clarity's sake)

function [matrixdata] = replaceNaNswithModes(matrixdata)

for i=1: size(matrixdata,2)
cv= matrixdata(:,i);
modevalue= mode(cv);
cv(find(isnan(cv))) = modevalue;
matrixdata(:,i) = cv;
end

3) Or any other sensible way that would make sense for "categorical" data?

P.S: This link gives possible ways to handle missing data.

like image 708
Zhubarb Avatar asked Dec 04 '25 16:12

Zhubarb


1 Answers

I suggest to use a table instead of a matrix. Then you have functions such as ismissing (for the entire table), and isundefined to deal with missing values for categorical variables.

T = array2table(matrix);
T = standardizeMissing(T);  % NaN is standard for double but this 
                            % can be useful for other data type
var1 = categorical(T.var1);
missing = isundefined(var1);
T = T(missing,:);           % removes lines with NaN
matrix = table2array(T);
like image 95
John Steed Avatar answered Dec 06 '25 09:12

John Steed