There appears to be a quirk with the pandas merge function. It considers <code>NaN</code> values to be equal, and will merge <code>NaN</code>s with other <code>NaN</code>s: <pre class="prettyprint"><code>>>> foo = DataFrame([ ['a',1,2], ['b',4,5], ['c',7,8], [np.NaN,10,11] ], columns=['id','x','y']) >>> bar = DataFrame([ ['a',3], ['c',9], [np.NaN,12] ], columns=['id','z']) >>> pd.merge(foo, bar, how='left', on='id') Out[428]: id x y z 0 a 1 2 3 1 b 4 5 NaN 2 c 7 8 9 3 NaN 10 11 12 [4 rows x 4 columns] </code></pre> This is unlike any RDB I've seen, normally missing values are treated with agnosticism and won't be merged together as if they are equal. This is especially problematic for datasets with sparse data (every NaN will be merged to every other NaN, resulting in a huge DataFrame!) Is there a way to ignore missing values during a merge without first slicing them out?

You could exclude values from <code>bar</code> (and indeed <code>foo</code> if you wanted) where <code>id</code> is null during the merge. Not sure it's what you're after, though, as they are sliced out. (I've assumed from your left join that you're interested in retaining all of <code>foo</code>, but only want to merge the parts of <code>bar</code> that match and are not null.) <pre class="prettyprint"><code>foo.merge(bar[pd.notnull(bar.id)], how='left', on='id') Out[11]: id x y z 0 a 1 2 3 1 b 4 5 NaN 2 c 7 8 9 3 NaN 10 11 NaN </code></pre>

if do not need NaN in both left and right DF, use <code>pd.merge(foo.dropna(subset=['id']), bar.dropna(subset=['id']), how='left', on='id') </code> else if need NaN in left DF, use <pre class="prettyprint"><code>pd.merge(foo, bar.dropna(subset=['id']), how='left', on='id') </code></pre>

pandas - merging with missing values

Tags:

python

merge

pandas

missing-data

There appears to be a quirk with the pandas merge function. It considers NaN values to be equal, and will merge NaNs with other NaNs:

>>> foo = DataFrame([
    ['a',1,2],
    ['b',4,5],
    ['c',7,8],
    [np.NaN,10,11]
], columns=['id','x','y'])

>>> bar = DataFrame([
    ['a',3],
    ['c',9],
    [np.NaN,12]
], columns=['id','z'])

>>> pd.merge(foo, bar, how='left', on='id')
Out[428]: 
    id   x   y   z
0    a   1   2   3
1    b   4   5 NaN
2    c   7   8   9
3  NaN  10  11  12

[4 rows x 4 columns]

This is unlike any RDB I've seen, normally missing values are treated with agnosticism and won't be merged together as if they are equal. This is especially problematic for datasets with sparse data (every NaN will be merged to every other NaN, resulting in a huge DataFrame!)

Is there a way to ignore missing values during a merge without first slicing them out?

452

asked May 29 '14 18:05

aensm

2 Answers

You could exclude values from bar (and indeed foo if you wanted) where id is null during the merge. Not sure it's what you're after, though, as they are sliced out.

(I've assumed from your left join that you're interested in retaining all of foo, but only want to merge the parts of bar that match and are not null.)

foo.merge(bar[pd.notnull(bar.id)], how='left', on='id')

Out[11]: 
id   x   y   z
0    a   1   2   3
1    b   4   5 NaN
2    c   7   8   9
3  NaN  10  11 NaN

answered Oct 13 '22 02:10

meloncholy

if do not need NaN in both left and right DF, use

pd.merge(foo.dropna(subset=['id']), bar.dropna(subset=['id']), how='left', on='id')

else if need NaN in left DF, use

pd.merge(foo, bar.dropna(subset=['id']), how='left', on='id')

answered Oct 13 '22 03:10

Liang

Related questions
                            
                                Affine transformation between contours in OpenCV
                            
                                _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?) [duplicate]
                            
                                Are sessions needed for python-social-auth
                            
                                Python: how to check if an item was added to a set, without 2x (hash, lookup)
                            
                                Does jedi-vim conflict with YouCompleteMe?
                            
                                How to make an Python subclass uncallable
                            
                                How to read data into TensorFlow batches from example queue?
                            
                                How to install a python package with all the dependencies into a Docker image?
                            
                                AnalysisException: u"cannot resolve 'name' given input columns: [ list] in sqlContext in spark
                            
                                How to update model parameters with accumulated gradients?
                            
                                Python type hint for (any) class
                            
                                CRITICAL WORKER TIMEOUT error on gunicorn django
                            
                                Why does pd.concat change the resulting datatype from int to float?
                            
                                Should I use get_/set_ prefixes in Python method names?
                            
                                Putting separate python packages into same namespace?
                            
                                Powershell equivalent of python's if __name__ == '__main__':
                            
                                dateutil and pytz give different results
                            
                                pylint: ignore multiple in rcfile
                            
                                How to enable line wrapping in ipython notebook
                            
                                Difference between WSGI utilities and Web Servers [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With