I recently asked a question regarding missing values in pandas here and was directed to a github issue. After reading through that page and the missing data documentation. I am wondering why <code>merge</code> and <code>join</code> treat NaNs as a match when "they don't compare equal": <code>np.nan != np.nan</code> <pre class="prettyprint"><code># merge example df = pd.DataFrame({'col1':[np.nan, 'match'], 'col2':[1,2]}) df2 = pd.DataFrame({'col1':[np.nan, 'no match'], 'col3':[3,4]}) pd.merge(df,df2, on='col1') col1 col2 col3 0 NaN 1 3 # join example with same dataframes from above df.set_index('col1').join(df2.set_index('col1')) col2 col3 col1 NaN 1 3.0 match 2 NaN </code></pre> However, NaNs in <code>groupby</code> are excluded: <pre class="prettyprint"><code>df = pd.DataFrame({'col1':[np.nan, 'match', np.nan], 'col2':[1,2,1]}) df.groupby('col1').sum() col2 col1 match 2 </code></pre> Of course you can <code>dropna()</code> or <code>df[df['col1'].notnull()]</code> but I am curious as to why NaNs are excluded in some pandas operations like <code>groupby</code> and not others like <code>merge</code>, <code>join</code>, <code>update</code>, and <code>map</code>? Essentially, as I asked above, why does <code>merge</code> and <code>join</code> match on <code>np.nan</code> when they do not compare equal?

Yeah, this is definitely a bug. See GH22491 which documents exactly your issue, and GH22618 which notes the problem is also observed with <code>None</code>. based on the discussions, this does not appear to be intended behaviour. A quick source dive shows that the issue *might* be inside the <code>_factorize_keys</code> function in <code>pandas/core/reshape/merge.py</code>. This function appears to factorise the keys to determine what rows are to be matched with each other. Specifically, this portion <pre class="prettyprint"><code># NA group lmask = llab == -1 lany = lmask.any() rmask = rlab == -1 rany = rmask.any() if lany or rany: if lany: np.putmask(llab, lmask, count) if rany: np.putmask(rlab, rmask, count) count += 1 </code></pre> ...seems to be the culprit. NaN keys are identified as a valid category (with categorical value equal to <code>count</code>). Disclaimer: I am not a pandas dev, and this is only my speculation; so the real issue could be something else. But from first glance, this seems like it.

Why does pandas merge on NaN?

Tags:

python

python-3.x

pandas

I recently asked a question regarding missing values in pandas here and was directed to a github issue. After reading through that page and the missing data documentation.

I am wondering why merge and join treat NaNs as a match when "they don't compare equal": np.nan != np.nan

Click to copy

# merge example
df = pd.DataFrame({'col1':[np.nan, 'match'], 'col2':[1,2]})
df2 = pd.DataFrame({'col1':[np.nan, 'no match'], 'col3':[3,4]})
pd.merge(df,df2, on='col1')

    col1    col2    col3
0   NaN      1       3

# join example with same dataframes from above
df.set_index('col1').join(df2.set_index('col1'))

      col2  col3
col1        
NaN     1   3.0
match   2   NaN

However, NaNs in groupby are excluded:

Click to copy

df = pd.DataFrame({'col1':[np.nan, 'match', np.nan], 'col2':[1,2,1]})
df.groupby('col1').sum()

       col2
col1    
match   2

Of course you can dropna() or df[df['col1'].notnull()] but I am curious as to why NaNs are excluded in some pandas operations like groupby and not others like merge, join, update, and map?

Essentially, as I asked above, why does merge and join match on np.nan when they do not compare equal?

959

asked Dec 09 '18 03:12

It_is_Chris

1 Answers

Yeah, this is definitely a bug. See GH22491 which documents exactly your issue, and GH22618 which notes the problem is also observed with None. based on the discussions, this does not appear to be intended behaviour.

A quick source dive shows that the issue *might* be inside the _factorize_keys function in pandas/core/reshape/merge.py. This function appears to factorise the keys to determine what rows are to be matched with each other.

Specifically, this portion

Click to copy

# NA group
lmask = llab == -1
lany = lmask.any()
rmask = rlab == -1
rany = rmask.any()

if lany or rany:
    if lany:
        np.putmask(llab, lmask, count)
    if rany:
        np.putmask(rlab, rmask, count)
    count += 1

...seems to be the culprit. NaN keys are identified as a valid category (with categorical value equal to count).

Disclaimer: I am not a pandas dev, and this is only my speculation; so the real issue could be something else. But from first glance, this seems like it.

169

answered Oct 20 '22 09:10

cs95

Related questions
                            
                                When will numpy copy the array when using reshape()
                            
                                Django DRY Model/Form/Serializer Validation
                            
                                How to integrate behave into pytest?
                            
                                Why does Python read from the current directory when printing a traceback?
                            
                                catching "socket.timeout The read operation timed out" in python
                            
                                Is it possible to use *args in a dataclass?
                            
                                Is there a way to kill uvicorn cleanly?
                            
                                Extract target from Tensorflow PrefetchDataset
                            
                                How to deal with warning : "Workbook contains no default style, apply openpyxl's default "
                            
                                python: xml.etree.elementtree.ElemenTtree.write() declaration tag
                            
                                Python multiprocessing logging - why multiprocessing.get_logger
                            
                                How to share a cache between multiple processes?
                            
                                Why does pytest + xdist not capture output?
                            
                                Django and Dropzone.js
                            
                                Debugging Python and C++ exposed by boost together
                            
                                javascript dependencies in python project
                            
                                Database indexes in Django 1.11: difference between db_true, indexes and index_together
                            
                                Flask-SQLAlchemy backref function and backref parameter
                            
                                Python - curly braces in type hints
                            
                                AttributeError: module 'tensorflow' has no attribute 'python'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does pandas merge on NaN?

Tags:

python

python-3.x

pandas

It_is_Chris

People also ask

1 Answers

cs95

Recent Activity

Donate For Us