Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does fillna with median on dataframe still leaves Na/NaN in pandas?

I've seen this and this thread here, but something else is wrong.

I have a very large pandas DataFrame, with many Na/NaN values. I want to replace them with the median value for that feature.

So, I first make a table that displays the Na values per feature, sorted by most Na values, then use fillna(), and then display that table again. Ideally, the second time, that table should have all 0's, because all the Na's have been filled.

nullCount = pd.DataFrame(TT_df.isnull().sum(),columns=["nullcount"]).sort_values(by="nullcount",ascending=False)
display(nullCount.head(10))

TT_df = TT_df.fillna(TT_df.median())

nullCount = pd.DataFrame(TT_df.isnull().sum(),columns=["nullcount"]).sort_values(by="nullcount",ascending=False)
display(nullCount.head(10))

However, I get these two tables:

null count tables, before and after

and if I take a look at the DataFrame, you can see NaN's in it:

display(TT_df[nullCount.index.tolist()[0:5]].head(50))

NaN examples

It seems like a common problem with fillna() is that it returns a copy, unless you use inplace=True (like in the linked threads above), but I'm not doing that: I'm overwriting TT_df, unless I'm misunderstanding something. You can see that the LotFrontage feature actually does disappear from the second table, implying that the fillna() did work for it. So why isn't it working for the others?

What I suspect is the culprit, though I don't know why, is that Na doesn't actually mean Na for these features: if I look at the data description file, it says:

GarageFinish: Interior finish of the garage

   Fin    Finished
   RFn    Rough Finished  
   Unf    Unfinished
   NA No Garage

Okay, that's fine. But it feels like those NA values should either count as Na for both isnull() and fillna(), or not count for either. Why does it appear to be counted by isnull() but not fillna()?

like image 826
GrundleMoof Avatar asked Oct 16 '22 18:10

GrundleMoof


1 Answers

The problem is with this line:

TT_df = TT_df.fillna(TT_df.median())

Your dataframe has strings and you are attempting to calculate medians on strings. This doesn't work.

Here's a minimal example:

import pandas as pd, numpy as np

df = pd.DataFrame({'A': ['A', 'B', np.nan, 'B']})

df = df.fillna(df.median())

print(df)

     A
0    A
1    B
2  NaN
3    B

What you should do is fillna with median only for numeric columns:

for col in df.select_dtypes(include=np.number):
    df[col] = df[col].fillna(df[col].median())
like image 133
jpp Avatar answered Oct 20 '22 22:10

jpp