I recently asked a question regarding missing values in pandas here and was directed to a github issue. After reading through that page and the missing data documentation.
I am wondering why merge
and join
treat NaNs as a match when "they don't compare equal": np.nan != np.nan
# merge example
df = pd.DataFrame({'col1':[np.nan, 'match'], 'col2':[1,2]})
df2 = pd.DataFrame({'col1':[np.nan, 'no match'], 'col3':[3,4]})
pd.merge(df,df2, on='col1')
col1 col2 col3
0 NaN 1 3
# join example with same dataframes from above
df.set_index('col1').join(df2.set_index('col1'))
col2 col3
col1
NaN 1 3.0
match 2 NaN
However, NaNs in groupby
are excluded:
df = pd.DataFrame({'col1':[np.nan, 'match', np.nan], 'col2':[1,2,1]})
df.groupby('col1').sum()
col2
col1
match 2
Of course you can dropna()
or df[df['col1'].notnull()]
but I am curious as to why NaNs are excluded in some pandas operations like groupby
and not others like merge
, join
, update
, and map
?
Essentially, as I asked above, why does merge
and join
match on np.nan
when they do not compare equal?
To merge two Pandas DataFrame with common column, use the merge() function and set the ON parameter as the column name. To set NaN for unmatched values, use the “how” parameter and set it left or right. That would mean, merging left or right.
If you want to treat the value as a missing value, you can use the replace() method to replace it with float('nan') , np. nan , and math. nan .
In applied data science, you will usually have missing data. For example, an industrial application with sensors will have sensor data that is missing on certain days. You have a couple of alternatives to work with missing data.
We can replace NaN values with 0 to get rid of NaN values. This is done by using fillna() function. This function will check the NaN values in the dataframe columns and fill the given value.
Yeah, this is definitely a bug. See GH22491 which documents exactly your issue, and GH22618 which notes the problem is also observed with None
. based on the discussions, this does not appear to be intended behaviour.
A quick source dive shows that the issue *might* be inside the _factorize_keys
function in pandas/core/reshape/merge.py
. This function appears to factorise the keys to determine what rows are to be matched with each other.
Specifically, this portion
# NA group
lmask = llab == -1
lany = lmask.any()
rmask = rlab == -1
rany = rmask.any()
if lany or rany:
if lany:
np.putmask(llab, lmask, count)
if rany:
np.putmask(rlab, rmask, count)
count += 1
...seems to be the culprit. NaN keys are identified as a valid category (with categorical value equal to count
).
Disclaimer: I am not a pandas dev, and this is only my speculation; so the real issue could be something else. But from first glance, this seems like it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With