Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas - merging with missing values

There appears to be a quirk with the pandas merge function. It considers NaN values to be equal, and will merge NaNs with other NaNs:

>>> foo = DataFrame([
    ['a',1,2],
    ['b',4,5],
    ['c',7,8],
    [np.NaN,10,11]
], columns=['id','x','y'])

>>> bar = DataFrame([
    ['a',3],
    ['c',9],
    [np.NaN,12]
], columns=['id','z'])

>>> pd.merge(foo, bar, how='left', on='id')
Out[428]: 
    id   x   y   z
0    a   1   2   3
1    b   4   5 NaN
2    c   7   8   9
3  NaN  10  11  12

[4 rows x 4 columns]

This is unlike any RDB I've seen, normally missing values are treated with agnosticism and won't be merged together as if they are equal. This is especially problematic for datasets with sparse data (every NaN will be merged to every other NaN, resulting in a huge DataFrame!)

Is there a way to ignore missing values during a merge without first slicing them out?

like image 452
aensm Avatar asked May 29 '14 18:05

aensm


People also ask

How do you deal with missing values in pandas?

In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series.

How do I merge pandas without duplicates?

To concatenate DataFrames, use the concat() method, but to ignore duplicates, use the drop_duplicates() method.

What is the difference between merging and concatenation in pandas?

merge() for combining data on common columns or indices. . join() for combining data on a key column or an index. concat() for combining DataFrames across rows or columns.

Is pandas merge efficient?

A merge is also just as efficient as a join as long as: Merging is done on indexes if possible. The “on” parameter is avoided, and instead, both columns to merge on are explicitly stated using the keywords left_on, left_index, right_on, and right_index (when applicable).


2 Answers

You could exclude values from bar (and indeed foo if you wanted) where id is null during the merge. Not sure it's what you're after, though, as they are sliced out.

(I've assumed from your left join that you're interested in retaining all of foo, but only want to merge the parts of bar that match and are not null.)

foo.merge(bar[pd.notnull(bar.id)], how='left', on='id')

Out[11]: 
id   x   y   z
0    a   1   2   3
1    b   4   5 NaN
2    c   7   8   9
3  NaN  10  11 NaN
like image 60
meloncholy Avatar answered Oct 13 '22 02:10

meloncholy


if do not need NaN in both left and right DF, use

pd.merge(foo.dropna(subset=['id']), bar.dropna(subset=['id']), how='left', on='id')

else if need NaN in left DF, use

pd.merge(foo, bar.dropna(subset=['id']), how='left', on='id')
like image 22
Liang Avatar answered Oct 13 '22 03:10

Liang