Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does pandas merge on NaN?

I recently asked a question regarding missing values in pandas here and was directed to a github issue. After reading through that page and the missing data documentation.

I am wondering why merge and join treat NaNs as a match when "they don't compare equal": np.nan != np.nan

# merge example
df = pd.DataFrame({'col1':[np.nan, 'match'], 'col2':[1,2]})
df2 = pd.DataFrame({'col1':[np.nan, 'no match'], 'col3':[3,4]})
pd.merge(df,df2, on='col1')

    col1    col2    col3
0   NaN      1       3

# join example with same dataframes from above
df.set_index('col1').join(df2.set_index('col1'))

      col2  col3
col1        
NaN     1   3.0
match   2   NaN

However, NaNs in groupby are excluded:

df = pd.DataFrame({'col1':[np.nan, 'match', np.nan], 'col2':[1,2,1]})
df.groupby('col1').sum()

       col2
col1    
match   2

Of course you can dropna() or df[df['col1'].notnull()] but I am curious as to why NaNs are excluded in some pandas operations like groupby and not others like merge, join, update, and map?

Essentially, as I asked above, why does merge and join match on np.nan when they do not compare equal?

like image 959
It_is_Chris Avatar asked Dec 09 '18 03:12

It_is_Chris


People also ask

Does pandas merge NaN?

To merge two Pandas DataFrame with common column, use the merge() function and set the ON parameter as the column name. To set NaN for unmatched values, use the “how” parameter and set it left or right. That would mean, merging left or right.

How do I fix NaN in pandas?

If you want to treat the value as a missing value, you can use the replace() method to replace it with float('nan') , np. nan , and math. nan .

Why am I getting NaN in pandas?

In applied data science, you will usually have missing data. For example, an industrial application with sensors will have sensor data that is missing on certain days. You have a couple of alternatives to work with missing data.

How do I fix NaN error in Python?

We can replace NaN values with 0 to get rid of NaN values. This is done by using fillna() function. This function will check the NaN values in the dataframe columns and fill the given value.


1 Answers

Yeah, this is definitely a bug. See GH22491 which documents exactly your issue, and GH22618 which notes the problem is also observed with None. based on the discussions, this does not appear to be intended behaviour.

A quick source dive shows that the issue *might* be inside the _factorize_keys function in pandas/core/reshape/merge.py. This function appears to factorise the keys to determine what rows are to be matched with each other.

Specifically, this portion

# NA group
lmask = llab == -1
lany = lmask.any()
rmask = rlab == -1
rany = rmask.any()

if lany or rany:
    if lany:
        np.putmask(llab, lmask, count)
    if rany:
        np.putmask(rlab, rmask, count)
    count += 1

...seems to be the culprit. NaN keys are identified as a valid category (with categorical value equal to count).

Disclaimer: I am not a pandas dev, and this is only my speculation; so the real issue could be something else. But from first glance, this seems like it.

like image 169
cs95 Avatar answered Oct 20 '22 09:10

cs95