Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Greater/less than comparisons between Pandas DataFrames/Series

Tags:

python

pandas

How can I perform comparisons between DataFrames and Series? I'd like to mask elements in a DataFrame/Series that are greater/less than elements in another DataFrame/Series.

For instance, the following doesn't replace elements greater than the mean with nans although I was expecting it to:

>>> x = pd.DataFrame(data={'a': [1, 2], 'b': [3, 4]})
>>> x[x > x.mean(axis=1)] = np.nan
>>> x
   a  b
0  1  3
1  2  4

If we look at the boolean array created by the comparison, it is really weird:

>>> x = pd.DataFrame(data={'a': [1, 2], 'b': [3, 4]})
>>> x > x.mean(axis=1)
       a      b      0      1
0  False  False  False  False
1  False  False  False  False

I don't understand by what logic the resulting boolean array is like that. I'm able to work around this problem by using transpose:

>>> (x.T > x.mean(axis=1).T).T
       a     b
0  False  True
1  False  True

But I believe there is some "correct" way of doing this that I'm not aware of. And at least I'd like to understand what is going on.

like image 541
Jaakko Luttinen Avatar asked Oct 31 '22 16:10

Jaakko Luttinen


1 Answers

The problem here is that it's interpreting the index as column values to perform the comparison, if you use .gt and pass axis=0 then you get the result you desire:

In [203]:
x.gt(x.mean(axis=1), axis=0)

Out[203]:
       a     b
0  False  True
1  False  True

You can see what I mean when you perform the comparison with the np array:

In [205]:
x > x.mean(axis=1).values

Out[205]:
       a      b
0  False  False
1  False   True

here you can see that the default axis for comparison is on the column, resulting in a different result

like image 154
EdChum Avatar answered Nov 14 '22 01:11

EdChum