I'm trying to move away from using chained assignment in Pandas. While I discovered this behavior trying to update my assignment practices, I don't think it's related to assignment per se. As shown in the examples below, it seems to be the filtering process itself. It seems like a bug... but perhaps I'm missing something?
df = pd.DataFrame( data = {
'acol' : [1.0, 3.00, 11.0, 31.0, 100.0, 314.0],
'bcol' : [1.0, 3.14, 10.1, 31.4, 81.2, 314.1]
})
My old method would have been to filter and re-assign, keeping some rows conditionally:
In[20]: df[(df.acol > df.bcol) & (df.acol > 10)]
Out[20]:
acol bcol
2 11.0 10.1
4 100.0 81.2
The new, more responsible me is trying to use the .drop()
function instead... but the results seem inconsistent :
In[21]: df.drop((df.acol <= df.bcol) | (df.acol <= 10), axis=0)
Out[21]:
acol bcol
2 11.0 10.1
3 31.0 31.4
4 100.0 81.2
5 314.0 314.1
De Morgan's Theorem says those should give the same result (https://en.wikipedia.org/wiki/De_Morgan%27s_laws).
I've checked the docs (http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing), but this behavior doesn't seem to be mentioned. I'm aware that there's some ambiguity in dataframe indexing, but by using the bitwise operators, I should be avoiding that (Pandas boolean DataFrame selection ambiguity).
What's even more concerning is that if I remove the zeroth entry from each list for the input data, I get different results when using .drop()
(row 2 disappears)!
I'm using Python 3.6, Pandas 0.20.2
Halp?
You need boolean indexing
with df.index
and then drop
:
print (df.index[((df.acol <= df.bcol) | (df.acol <= 10))])
Int64Index([0, 1, 3, 5], dtype='int64')
df1 = df.drop(df.index[((df.acol <= df.bcol) | (df.acol <= 10))])
print (df1)
acol bcol
2 11.0 10.1
4 100.0 81.2
In your code:
df.drop((df.acol <= df.bcol) | (df.acol <= 10))
is boolean mask converted to 0
and 1
, so first and second row was removed.
print ((df.acol <= df.bcol) | (df.acol <= 10))
0 True
1 True
2 False
3 True
4 False
5 True
dtype: bool
So in my opinion use drop
here is a bit overcomplicated solution, better is use boolean indexing
only:
df[(df.acol > df.bcol) & (df.acol > 10)]
or alternative solution with query
:
df1 = df.query("acol > bcol & acol > 10")
print (df1)
acol bcol
2 11.0 10.1
4 100.0 81.2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With