Given this data frame:
In [40]: df = pd.DataFrame({'A': [1, 1], 'B': [2, 2], 'C': [3, 3]})
In [41]: df
Out[41]:
A B C
0 1 2 3
1 1 2 3
If I pass a list of strings to []
, it will filter columns:
In [42]: df[['A', 'C']]
Out[42]:
A C
0 1 3
1 1 3
But if I pass a list of booleans to []
, it will filter rows:
In [45]: df[[True, False]]
Out[45]:
A B C
0 1 2 3
Is there a way to think about this difference, rather than "it's just the way it is"?
My understanding is that this copied R's behavior to make migrating R scripts easier, it also started with ix
to which is deprecated. There was a lot of ways to to do slicing, but we have fewer now:
Personally I like to use __getitem__
for all of those:
In [11]: df[['A', 'C']]
Out[11]:
A C
0 1 3
1 1 3
In [12]: df['A']
Out[12]:
0 1
1 1
Name: A, dtype: int64
The alternative, though it has less ambiguity (loc
(or iloc
) is too verbose:
In [13]: df.loc[:, ['A', 'B']]
Out[13]:
A B
0 1 2
1 1 2
In [14]: df.loc[:, 'A']
Out[14]:
0 1
1 1
Name: A, dtype: int64
It's worth noting that boolean masking is not ambiguous, unless you have an esoteric example where the boolean columns and the input length matches that of the DataFrame:
In [21]: df1 = pd.DataFrame({True: [1, 2], False: [3, 4]})
In [22]: df1
Out[22]:
False True
0 3 1
1 4 2
In [23]: df1[[True, False]] # boolean slicing (not as column names)
Out[23]:
False True
0 3 1
Historically, there was potential ambiguity in ix
(as well as performance issues - there's a lot of possible paths to take). So as well as removing ambiguity, the move to loc
and iloc
also led to faster code (generally use iloc
if you can it's the fastest).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With