I have a data set in which the variable GENDER containing two levels Male(M) and Female(F) has lot of Missing values . How do i deal with missing value? What are the different methods to handle these missing values. Any help would be appreciated.
Missing values are common when working with real-world datasets – not the cleaned ones available on Kaggle, for example. Missing data could result from a human factor (for example, a person deliberately failing to respond to a survey question), a problem in electrical sensors, or other factors.
How to Find the Missing Values. To find the missing values from a list, define the value to check for and the list to be checked inside a COUNTIF statement. If the value is found in the list then the COUNTIF statement returns the numerical value which represents the number of times the value occurs in that list.
Imputation means replacing a missing value with another value based on a reasonable estimate. You use other data to recreate the missing value for a more complete dataset.
There are several techniques in order to estimate a missing value. I've been writing a paper for a project at Uni regarding such methods.
I will briefly explain 5 commonly used missing data imputation techniques. Hereinafter we will consider a dataset in which every row is a pattern (or observation) and every column is a feature (or attribute) and let's say we want to "fix" a given pattern which has a missing value in its j-th feature (position).
The K value for the K-Nearest Neighbours can be found by cross-validation, can be set a priori or you can use the rule-of-thumb value (K = square root of the number of instances).
The dissimilarity measure is actually up to you, but a common choice is the HEOM (Heterogeneous Euclidean Overlap Metric) which can be found here (Section 2.3). Such dissimilarity measure is pretty valid in datasets with loads of missing values since it allows you to deal with patterns having missing values as well (obviously not in the feature you want to estimate).
It is indeed important to discard patterns that are missing value in the feature to be imputed: if your dissimilarity measure returns the most similar pattern that also is missing value in feature j, you are basically substituting a missing value with another missing value. Pointless. This example works for the Hot-decking but you can extend such concept even for the K most similar patterns in the K-nearest neighbours (i.e. the unlucky case in which the most frequent item amongst the j-th feature for the K most similar patterns is a missing value as well).
It depends a lot on the specific case. However, some general methods are:
Removing the rows where some of the data is missing.
Imputing missing values. Basically, you can consider the gender column as something you must predict (using, possibly, the other columns). Train your predictor using the rows that have all values, and predict the missing ones.
Creating a third category of "missing", and letting the machine learning algorithm deal with it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With