I have a data set in which there is a series known as Outlet_Size
which contain either of {'Medium', nan, 'High', 'Small'}
around 2566 records are missing so I thought to fill it with mode() value so I wrote something like this :
train['Outlet_Size']=train['Outlet_Size'].fillna(train['Outlet_Size'].dropna().mode()]
But when I tried to find number of missing NaN record by command
sum(train['Outlet_Size'].isnull())
it is still showing 2566 NaN records.Why is it so ?
Thank you for answers
The problem here is that mode
returns a series and this is causing the fillna
to fail, if we look at a simple example:
In [194]:
df = pd.DataFrame({'a':['low','low',np.NaN,'medium','medium','medium','medium']})
df
Out[194]:
a
0 low
1 low
2 NaN
3 medium
4 medium
5 medium
6 medium
In [195]:
df['a'].fillna(df['a'].mode())
Out[195]:
0 low
1 low
2 NaN
3 medium
4 medium
5 medium
6 medium
Name: a, dtype: object
So you can see that it fails above, if we look at what mode
returns:
In [196]:
df['a'].mode()
Out[196]:
0 medium
dtype: object
it's a series albeit with a single row, so when you pass this to fillna
it only fills the first row, so what you want is to get the scalar value by indexing into the Series
:
In [197]:
df['a'].fillna(df['a'].mode()[0])
Out[197]:
0 low
1 low
2 medium
3 medium
4 medium
5 medium
6 medium
Name: a, dtype: object
EDIT
Regarding whether dropna
is required, no it isn't:
In [204]:
df = pd.DataFrame({'a':['low','low',np.NaN,'medium','medium','medium','medium',np.NaN,np.NaN,np.NaN,np.NaN]})
df['a'].mode()
Out[204]:
0 medium
dtype: object
You can see that NaN
is ignored
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With