Missing Value in Data Analysis

Tags:

I have a data set in which the variable GENDER containing two levels Male(M) and Female(F) has lot of Missing values . How do i deal with missing value? What are the different methods to handle these missing values. Any help would be appreciated.

481

asked Feb 28 '16 08:02

Milan Amrut Joshi

2 Answers

There are several techniques in order to estimate a missing value. I've been writing a paper for a project at Uni regarding such methods.
I will briefly explain 5 commonly used missing data imputation techniques. Hereinafter we will consider a dataset in which every row is a pattern (or observation) and every column is a feature (or attribute) and let's say we want to "fix" a given pattern which has a missing value in its j-th feature (position).

Pattern removal.
Remove pattern from dataset if such pattern has at least one missing value.
If there are loads of patterns with missing values, however, I would not suggest such approach since the number of patterns in your dataset will drastically decrease and the training phase will not be adeguate.
The mean/mode approach.
If pattern has a missing value in position j take the mean (if j-th attribute is continuous) or mode (if j-th attribute is categorical) of the j-th column and substitute such mean/mode in your pattern's j-th position. Obviously in the mean/mode evaluation you should consider only non-missing values from column j.
The conditional mean/mode.
If you have the labels (i.e. supervised learning), you can consider the previous approach but taking into account, in the mean/mode evaluation, only (non-missing) elements from column j belonging to patterns that have the very same label as the pattern you're trying to fix. This essentially refines the previous method because you do not consider values for patterns belonging to a different class.
Hot-decking.
Given a certain dissimilarity metric, you can measure the dissimilarity between the pattern you want to fix and all the other patterns that are not missing values in the attribute to be imputed (j-th attribute in our case). Take the j-th feature from the most similar pattern and substitute it back in the j-th position of the pattern you want to fix.
K-Nearest Neighbours.
That is similar to Hot-decking but instead of considering the most similar pattern, you can consider the K most similar patterns that are not missing value in our j-th feature. Consider then the most frequent item (mode) amongst the j-th feature of these K patterns.

The K value for the K-Nearest Neighbours can be found by cross-validation, can be set a priori or you can use the rule-of-thumb value (K = square root of the number of instances).

The dissimilarity measure is actually up to you, but a common choice is the HEOM (Heterogeneous Euclidean Overlap Metric) which can be found here (Section 2.3). Such dissimilarity measure is pretty valid in datasets with loads of missing values since it allows you to deal with patterns having missing values as well (obviously not in the feature you want to estimate).
It is indeed important to discard patterns that are missing value in the feature to be imputed: if your dissimilarity measure returns the most similar pattern that also is missing value in feature j, you are basically substituting a missing value with another missing value. Pointless. This example works for the Hot-decking but you can extend such concept even for the K most similar patterns in the K-nearest neighbours (i.e. the unlucky case in which the most frequent item amongst the j-th feature for the K most similar patterns is a missing value as well).

answered Sep 25 '22 23:09

AlessioX

It depends a lot on the specific case. However, some general methods are:

Removing the rows where some of the data is missing.
Imputing missing values. Basically, you can consider the gender column as something you must predict (using, possibly, the other columns). Train your predictor using the rows that have all values, and predict the missing ones.
Creating a third category of "missing", and letting the machine learning algorithm deal with it.

answered Sep 23 '22 23:09

Ami Tavory

Related questions
                            
                                Histogram approximation for streaming data
                            
                                Basic understanding of the Adaboost algorithm
                            
                                What are the advantages or disadvantages of having multiple output nodes compared to a few within a neural network
                            
                                Implementations of local regression and local likelihood methods
                            
                                Implementing Support Vector Machine - EFFICIENTLY computing gram-matrix K
                            
                                How to train image (pixel) data in libsvm format to use for recognition with Java
                            
                                scikit learn clf.fit / score model accuracy
                            
                                SVM - relation between the number of training samples and the number of features
                            
                                Rescaling after feature scaling, linear regression
                            
                                Binning of continuous variables in sklearn ensemble and trees
                            
                                Gaussian-RBM fails on a trivial example
                            
                                which is best svm example which classifies plain input text?
                            
                                Vowpal Wabbit training and testing data formats
                            
                                Cannot connect PlainText (JSON) to Dataset at Azure Machine Learning
                            
                                Doing hyperparameter estimation for the estimator in each fold of Recursive Feature Elimination
                            
                                Learning rate of a Q learning agent
                            
                                Accuracy issue in caffe
                            
                                get function by its values in certain points
                            
                                Want to know the diff among pd.factorize, pd.get_dummies, sklearn.preprocessing.LableEncoder and OneHotEncoder [closed]
                            
                                How to map features from the output of a VectorAssembler back to the column names in Spark ML?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Missing Value in Data Analysis

Tags:

missing-data

machine-learning

data-analysis

method-missing

Milan Amrut Joshi

People also ask

2 Answers

AlessioX

Ami Tavory

Recent Activity

Donate For Us