Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - Handling NaNs in categorical data

I have a column in dataframe that has categorical data but some of the data is missing i.e. NaN. I want to carry out linear interpolation on this data to fill the missing values but am not sure how to go about it. I can't drop the NaNs to turn the data into a categorical type because I need to fill them. A simple example to demonstrate what am trying to do.

col1  col2
5     cloudy
3     windy
6     NaN
7     rainy
10    NaN

Say I want to convert col2 to categorical data but retain the NaNs and fill them using linear interpolation how do I go about it. Lets say after converting the column to categorical data it looks like this

col2
1
2
NaN
3
NaN

Then I can do linear interpolation and get something like this

col2
1
2
3
3
2

How can I achieve this?

like image 566
Wasswa Samuel Avatar asked Jan 26 '17 20:01

Wasswa Samuel


People also ask

How do you handle NaN in categorical data?

Step 1: Find which category occurred most in each category using mode(). Step 2: Replace all NAN values in that column with that category. Step 3: Drop original columns and keep newly imputed columns. Advantage: Simple and easy to implement for categorical variables/columns.

How do Pandas handle categorical data?

The basic strategy is to convert each category value into a new column and assign a 1 or 0 (True/False) value to the column. This has the benefit of not weighting a value improperly. There are many libraries out there that support one-hot encoding but the simplest one is using pandas ' . get_dummies() method.

How do you impute missing values for categorical variables?

Imputation Method 1: Most Common Class One approach to imputing categorical features is to replace missing values with the most common class. You can do with by taking the index of the most common feature given in Pandas' value_counts function.


2 Answers

UPDATE:

Is there a way to convert the data back to its original form after interpolation ie instead of 1,2 or 3 you have cloudy,windy and rainy again?

Solution: I've intentionally added more rows to your original DF:

In [129]: df
Out[129]:
   col1    col2
0     5  cloudy
1     3   windy
2     6     NaN
3     7   rainy
4    10     NaN
5     5  cloudy
6    10     NaN
7     7   rainy

In [130]: df.dtypes
Out[130]:
col1       int64
col2    category
dtype: object

In [131]: df.col2 = (df.col2.cat.codes.replace(-1, np.nan)
     ...:              .interpolate().astype(int).astype('category')
     ...:              .cat.rename_categories(df.col2.cat.categories))
     ...:

In [132]: df
Out[132]:
   col1    col2
0     5  cloudy
1     3   windy
2     6   rainy
3     7   rainy
4    10  cloudy
5     5  cloudy
6    10  cloudy
7     7   rainy

OLD "numerical" answer:

IIUC you can do this:

In [66]: df
Out[66]:
   col1    col2
0     5  cloudy
1     3   windy
2     6     NaN
3     7   rainy
4    10     NaN

first let's factorize col2:

In [67]: df.col2 = pd.factorize(df.col2, na_sentinel=-2)[0] + 1

In [68]: df
Out[68]:
   col1  col2
0     5     1
1     3     2
2     6    -1
3     7     3
4    10    -1

now we can interpolate it (replacing -1's with NaN's):

In [69]: df.col2.replace(-1, np.nan).interpolate().astype(int)
Out[69]:
0    1
1    2
2    2
3    3
4    3
Name: col2, dtype: int32

the same approach, but converting interpolated series to category dtype:

In [70]: df.col2.replace(-1, np.nan).interpolate().astype(int).astype('category')
Out[70]:
0    1
1    2
2    2
3    3
4    3
Name: col2, dtype: category
Categories (3, int64): [1, 2, 3]
like image 186
MaxU - stop WAR against UA Avatar answered Oct 17 '22 04:10

MaxU - stop WAR against UA


I know your asking for linear interpolation but this is just another way if you want to do this easier.As converting categories to Numbers isn't such a good idea I suggest this one.

you can simply use the interpolation method in pandas library with method 'pad' like:

df.interpolate(method='pad')

you can also see other methods and example of using them in here. (link is the pandas documentation of interpolation)

like image 29
Fatemeh Rahimi Avatar answered Oct 17 '22 04:10

Fatemeh Rahimi