How to handle missing NaNs for machine learning in python

Tags:

How to handle missing values in datasets before applying machine learning algorithm??.

I noticed that it is not a smart thing to drop missing NAN values. I usually do interpolate (compute mean) using pandas and fill it up the data which is kind of works and improves the classification accuracy but may not be the best thing to do.

Here is a very important question. What is the best way to handle missing values in data set?

For example if you see this dataset, only 30% has original data.

Int64Index: 7049 entries, 0 to 7048
Data columns (total 31 columns):
left_eye_center_x            7039 non-null float64
left_eye_center_y            7039 non-null float64
right_eye_center_x           7036 non-null float64
right_eye_center_y           7036 non-null float64
left_eye_inner_corner_x      2271 non-null float64
left_eye_inner_corner_y      2271 non-null float64
left_eye_outer_corner_x      2267 non-null float64
left_eye_outer_corner_y      2267 non-null float64
right_eye_inner_corner_x     2268 non-null float64
right_eye_inner_corner_y     2268 non-null float64
right_eye_outer_corner_x     2268 non-null float64
right_eye_outer_corner_y     2268 non-null float64
left_eyebrow_inner_end_x     2270 non-null float64
left_eyebrow_inner_end_y     2270 non-null float64
left_eyebrow_outer_end_x     2225 non-null float64
left_eyebrow_outer_end_y     2225 non-null float64
right_eyebrow_inner_end_x    2270 non-null float64
right_eyebrow_inner_end_y    2270 non-null float64
right_eyebrow_outer_end_x    2236 non-null float64
right_eyebrow_outer_end_y    2236 non-null float64
nose_tip_x                   7049 non-null float64
nose_tip_y                   7049 non-null float64
mouth_left_corner_x          2269 non-null float64
mouth_left_corner_y          2269 non-null float64
mouth_right_corner_x         2270 non-null float64
mouth_right_corner_y         2270 non-null float64
mouth_center_top_lip_x       2275 non-null float64
mouth_center_top_lip_y       2275 non-null float64
mouth_center_bottom_lip_x    7016 non-null float64
mouth_center_bottom_lip_y    7016 non-null float64
Image                        7049 non-null object

814

asked Jan 07 '15 17:01

pbu

1 Answers

What is the best way to handle missing values in data set?

There is NO best way, each solution/algorithm has their own pros and cons (and you can even mix some of them together to create your own strategy and tune the related parameters to come up one best satisfy your data, there are many research/papers about this topic).

For example, Mean Imputation is quick and simple, but it would underestimate the variance and the distribution shape is distorted by replacing NaN with the mean value, while KNN Imputation might not be ideal in a large data set in terms of time complexity, since it iterate over all the data points and perform calculation for each NaN value, and the assumption is that NaN attribute is correlated with other attributes.

How to handle missing values in datasets before applying machine learning algorithm??

In addition to mean imputation you mention, you could also take a look at K-Nearest Neighbor Imputation and Regression Imputation, and refer to the powerful Imputer class in scikit-learn to check existing APIs to use.

KNN Imputation

Calculate the mean of k nearest neighbors of this NaN point.

Regression Imputation

A regression model is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where that variable is missing.

Here links to scikit's 'Imputation of missing values' section. I have also heard of Orange library for imputation, but haven't had a chance to use it yet.

answered Nov 14 '22 11:11

Paul Lo

Related questions
                            
                                python logging: is it possible to add module name to formatter
                            
                                How to avoid race condition with unique checks in Django
                            
                                Why won't this django-rest-swagger API documentation display/work properly?
                            
                                Python Pandas custom time format in Excel output
                            
                                Where to get sphinxcontrib.autohttp.flask?
                            
                                Slice pandas dataframe in groups of consecutive values
                            
                                lxml - get a flat list of elements
                            
                                Alembic - sqlalchemy initial migration
                            
                                Flask, cannot assign requested address [duplicate]
                            
                                Adding a column of zeroes to a csr_matrix
                            
                                Decrease array size by averaging adjacent values with numpy
                            
                                PuLP very slow when adding many constraints
                            
                                Convert XML to dictionary in Python using lxml
                            
                                Why is numpy.power slower for integer exponents?
                            
                                Append markup string to a tag in BeautifulSoup
                            
                                Stop SIGALRM when function returns
                            
                                Cannot start Tkinter window in Visual Studio with Python Tools
                            
                                Python: reduce (list of strings) -> string
                            
                                How to get an input from user in Pygame and save it as a variable? [duplicate]
                            
                                Parse setup.py without setuptools

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to handle missing NaNs for machine learning in python

Tags:

python

pandas

missing-data

machine-learning

pbu

People also ask

1 Answers

Paul Lo

Recent Activity

Donate For Us