I have just noticed this:
df[df.condition1 & df.condition2]
df[(df.condition1) & (df.condition2)]
Why does the output of these two lines differ?
I cannot share the exact data but I am gonna try to provide as much detail as I can:
df[df.col1 == False & df.col2.isnull()] # returns 33 rows and the rule `df.col2.isnull()` is not in effect
df[(df.col1 == False) & (df.col2.isnull())] # returns 29 rows and both conditions are applied correctly
Thanks to @jezrael and @ayhan, here is what happened, and let me use the example provided by @jezael:
df = pd.DataFrame({'col1':[True, False, False, False],
'col2':[4, np.nan, np.nan, 1]})
print (df)
col1 col2
0 True 4.0
1 False NaN
2 False NaN
3 False 1.0
If we take a look at row 3:
col1 col2
3 False 1.0
and the way I wrote the condition:
df.col1 == False & df.col2.isnull() # is equivalent to False == False & False
Because the &
sign has higher priority than ==
, without brackets False == False & False
is equivalent of:
False == (False & False)
print(False == (False & False)) # prints True
With brackets:
print((False == False) & False) # prints False
I think it is a bit easier to illustrate this problem with numbers:
print(5 == 5 & 1) # prints False, because 5 & 1 returns 1 and 5==1 returns False
print(5 == (5 & 1)) # prints False, same reason as above
print((5 == 5) & 1) # prints 1, because 5 == 5 returns True, and True & 1 returns 1
So lessons learned: always add brackets!!!
I wish I can split the answer points to both @jezrael and @ayhan :(
The single bracket will output a Pandas Series, while a double bracket will output a Pandas DataFrame. Square brackets can also be used to access observations (rows) from a DataFrame.
Pandas uses other names for data types than Python, for example: object for textual data. A column in a DataFrame can only have one data type.
The equals() function is used to test whether two Pandas objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.
Pandas DataFrame bool() MethodThe bool() method returns a boolean value, True or False, reflecting the value of the DataFrame. This method will only work if the DataFrame has only 1 value, and that value must be either True or False, otherwise the bool() method will return an error.
There is no difference between df[condition1 & condition2]
and df[(condition1) & (condition2)]
. The difference arises when you write an expression and the operator &
takes precedence:
df = pd.DataFrame(np.random.randint(0, 10, size=(5, 3)), columns=list('abc'))
df
Out:
a b c
0 5 0 3
1 3 7 9
2 3 5 2
3 4 7 6
4 8 8 1
condition1 = df['a'] > 3
condition2 = df['b'] < 5
df[condition1 & condition2]
Out:
a b c
0 5 0 3
df[(condition1) & (condition2)]
Out:
a b c
0 5 0 3
However, if you type it like this you'll see an error:
df[df['a'] > 3 & df['b'] < 5]
Traceback (most recent call last):
File "<ipython-input-7-9d4fd21246ca>", line 1, in <module>
df[df['a'] > 3 & df['b'] < 5]
File "/home/ayhan/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py", line 892, in __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This is because 3 & df['b']
is evaluated first (this corresponds to False & df.col2.isnull()
in your example). So you need to group the conditions in parentheses:
df[(df['a'] > 3) & (df['b'] < 5)]
Out[8]:
a b c
0 5 0 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With