Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inconsistent Nan Key Error using Pandas Apply

I'm recoding multiple columns in a dataframe and have come across a strange result that I can't quite figure out. I'm probably not recoding in the most efficient manner possible, but it's mostly the error that I'm hoping someone can explain.

s1 = pd.DataFrame([np.nan, '1', '2', '3', '4', '5'], columns=['col1'])
s2 = pd.DataFrame([np.nan, 1, 2, 3, 4, 5], columns=['col1'])
s1_dic = {np.nan: np.nan, '1': 1, '2':2, '3':3, '4':3, '5':3}
s2_dic = {np.nan: np.nan, 1: 1, 2:2, 3:3, 4:3, 5:3}
s1['col1'].apply(lambda x: s1_dic[x])
s2['col1'].apply(lambda x: s2_dic[x])

s1 works fine, but when I try to do the same thing with a list of integers and a np.nan, I get KeyError: nan which is confusing. Any help would be appreciated.

like image 331
Brian Huey Avatar asked Nov 25 '25 02:11

Brian Huey


1 Answers

A workaround is to use the get dict method, rather than the lambda:

In [11]: s1['col1'].apply(s1_dic.get)
Out[11]:
0   NaN
1     1
2     2
3     3
4     3
5     3
Name: col1, dtype: float64

In [12]: s2['col1'].apply(s2_dic.get)
Out[12]:
0   NaN
1     1
2     2
3     3
4     3
5     3
Name: col1, dtype: float64

It's not clear to me right now why this is different...


Note: the dicts can be accessed by nan:

In [21]: s1_dic[np.nan]
Out[21]: nan

In [22]: s2_dic[np.nan]
Out[22]: nan

and hash(np.nan) == 0 so it's not that...


Update: Apparently the issue is with np.nan vs np.float64(np.nan), the former has np.nan is np.nan (because np.nan is bound to a specific instantiated nan object) whilst float('nan') is not float('nan'):

This means that get won't find float('nan'):

In [21]: nans = [float('nan') for _ in range(5)]

In [22]: {f: 1 for f in nans}
Out[22]: {nan: 1, nan: 1, nan: 1, nan: 1, nan: 1}

This means you can actually retrieve the nans from a dict, any such retrieval would be implementation specific! In fact, as the dict uses the id of these nans, this entire behavior above may be implementation specific (if nan shared the same id, as they may do in a REPL/ipython session).

You can catch the nullness beforehand:

In [31]: s2['col1'].apply(lambda x: s2_dic[x] if pd.notnull(x) else x)
Out[31]:
0   NaN
1     1
2     2
3     3
4     3
5     3
Name: col1, dtype: float64

But I think the original suggestion of using .get is a better option.

like image 163
Andy Hayden Avatar answered Nov 26 '25 16:11

Andy Hayden