Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Missing Value in Data Analysis

I have a data set in which the variable GENDER containing two levels Male(M) and Female(F) has lot of Missing values . How do i deal with missing value? What are the different methods to handle these missing values. Any help would be appreciated.

like image 481
Milan Amrut Joshi Avatar asked Feb 28 '16 08:02

Milan Amrut Joshi


People also ask

What are missing values in dataset?

Missing values are common when working with real-world datasets – not the cleaned ones available on Kaggle, for example. Missing data could result from a human factor (for example, a person deliberately failing to respond to a survey question), a problem in electrical sensors, or other factors.

How do you find the missing data value?

How to Find the Missing Values. To find the missing values from a list, define the value to check for and the list to be checked inside a COUNTIF statement. If the value is found in the list then the COUNTIF statement returns the numerical value which represents the number of times the value occurs in that list.

What is a missing data value called?

Imputation means replacing a missing value with another value based on a reasonable estimate. You use other data to recreate the missing value for a more complete dataset.


2 Answers

There are several techniques in order to estimate a missing value. I've been writing a paper for a project at Uni regarding such methods.
I will briefly explain 5 commonly used missing data imputation techniques. Hereinafter we will consider a dataset in which every row is a pattern (or observation) and every column is a feature (or attribute) and let's say we want to "fix" a given pattern which has a missing value in its j-th feature (position).

  • Pattern removal.
    Remove pattern from dataset if such pattern has at least one missing value.
    If there are loads of patterns with missing values, however, I would not suggest such approach since the number of patterns in your dataset will drastically decrease and the training phase will not be adeguate.
  • The mean/mode approach.
    If pattern has a missing value in position j take the mean (if j-th attribute is continuous) or mode (if j-th attribute is categorical) of the j-th column and substitute such mean/mode in your pattern's j-th position. Obviously in the mean/mode evaluation you should consider only non-missing values from column j.
  • The conditional mean/mode.
    If you have the labels (i.e. supervised learning), you can consider the previous approach but taking into account, in the mean/mode evaluation, only (non-missing) elements from column j belonging to patterns that have the very same label as the pattern you're trying to fix. This essentially refines the previous method because you do not consider values for patterns belonging to a different class.
  • Hot-decking.
    Given a certain dissimilarity metric, you can measure the dissimilarity between the pattern you want to fix and all the other patterns that are not missing values in the attribute to be imputed (j-th attribute in our case). Take the j-th feature from the most similar pattern and substitute it back in the j-th position of the pattern you want to fix.
  • K-Nearest Neighbours.
    That is similar to Hot-decking but instead of considering the most similar pattern, you can consider the K most similar patterns that are not missing value in our j-th feature. Consider then the most frequent item (mode) amongst the j-th feature of these K patterns.

The K value for the K-Nearest Neighbours can be found by cross-validation, can be set a priori or you can use the rule-of-thumb value (K = square root of the number of instances).

The dissimilarity measure is actually up to you, but a common choice is the HEOM (Heterogeneous Euclidean Overlap Metric) which can be found here (Section 2.3). Such dissimilarity measure is pretty valid in datasets with loads of missing values since it allows you to deal with patterns having missing values as well (obviously not in the feature you want to estimate).
It is indeed important to discard patterns that are missing value in the feature to be imputed: if your dissimilarity measure returns the most similar pattern that also is missing value in feature j, you are basically substituting a missing value with another missing value. Pointless. This example works for the Hot-decking but you can extend such concept even for the K most similar patterns in the K-nearest neighbours (i.e. the unlucky case in which the most frequent item amongst the j-th feature for the K most similar patterns is a missing value as well).

like image 54
AlessioX Avatar answered Sep 25 '22 23:09

AlessioX


It depends a lot on the specific case. However, some general methods are:

  1. Removing the rows where some of the data is missing.

  2. Imputing missing values. Basically, you can consider the gender column as something you must predict (using, possibly, the other columns). Train your predictor using the rows that have all values, and predict the missing ones.

  3. Creating a third category of "missing", and letting the machine learning algorithm deal with it.

like image 21
Ami Tavory Avatar answered Sep 23 '22 23:09

Ami Tavory