Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas boolean indexing error with .drop()

I'm trying to move away from using chained assignment in Pandas. While I discovered this behavior trying to update my assignment practices, I don't think it's related to assignment per se. As shown in the examples below, it seems to be the filtering process itself. It seems like a bug... but perhaps I'm missing something?

df = pd.DataFrame( data = {
                         'acol'  : [1.0, 3.00, 11.0, 31.0, 100.0, 314.0],
                         'bcol' :  [1.0, 3.14, 10.1, 31.4,  81.2, 314.1]
                        })

My old method would have been to filter and re-assign, keeping some rows conditionally:

In[20]: df[(df.acol > df.bcol) & (df.acol > 10)]

Out[20]:
    acol    bcol
2   11.0    10.1
4   100.0   81.2

The new, more responsible me is trying to use the .drop() function instead... but the results seem inconsistent :

In[21]: df.drop((df.acol <= df.bcol) | (df.acol <= 10), axis=0)
Out[21]: 
    acol    bcol
2   11.0    10.1
3   31.0    31.4
4   100.0   81.2
5   314.0   314.1

De Morgan's Theorem says those should give the same result (https://en.wikipedia.org/wiki/De_Morgan%27s_laws).

I've checked the docs (http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing), but this behavior doesn't seem to be mentioned. I'm aware that there's some ambiguity in dataframe indexing, but by using the bitwise operators, I should be avoiding that (Pandas boolean DataFrame selection ambiguity).

What's even more concerning is that if I remove the zeroth entry from each list for the input data, I get different results when using .drop() (row 2 disappears)!

I'm using Python 3.6, Pandas 0.20.2

Halp?

like image 319
Dan Parshall Avatar asked Jul 10 '17 03:07

Dan Parshall


1 Answers

You need boolean indexing with df.index and then drop:

print (df.index[((df.acol <= df.bcol) | (df.acol <= 10))])
Int64Index([0, 1, 3, 5], dtype='int64')

df1 = df.drop(df.index[((df.acol <= df.bcol) | (df.acol <= 10))])
print (df1)
    acol  bcol
2   11.0  10.1
4  100.0  81.2

In your code:

df.drop((df.acol <= df.bcol) | (df.acol <= 10))

is boolean mask converted to 0 and 1, so first and second row was removed.

print ((df.acol <= df.bcol) | (df.acol <= 10))
0     True
1     True
2    False
3     True
4    False
5     True
dtype: bool

So in my opinion use drop here is a bit overcomplicated solution, better is use boolean indexing only:

df[(df.acol > df.bcol) & (df.acol > 10)]

or alternative solution with query:

df1 = df.query("acol > bcol & acol > 10")
print (df1)
    acol  bcol
2   11.0  10.1
4  100.0  81.2
like image 68
jezrael Avatar answered Oct 20 '22 21:10

jezrael