Why can't I match a string in a Pandas series using in
? In the following example, the first evaluation results in False unexpectedly, but the second one works.
df = pd.DataFrame({'name': [ 'Adam', 'Ben', 'Chris' ]})
'Adam' in df['name']
'Adam' in list(df['name'])
Pandas isin() method is used to filter data frames. isin() method helps in selecting rows with having a particular(or Multiple) value in a particular column. Parameters: values: iterable, Series, List, Tuple, DataFrame or dictionary to check in the caller Series/Data Frame.
By using 'last', the last occurrence of each set of duplicated values is set on False and all others on True. By setting keep on False, all duplicates are True. To find duplicates on specific column(s), use subset .
isin() function check whether values are contained in Series. It returns a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly.
Because the in
operator is interpreted as a call to df['name'].__contains__('Adam')
. If you look at the implementation of __contains__
in pandas.Series
, you will find that it's the following (inhereted from pandas.core.generic.NDFrame
) :
def __contains__(self, key):
"""True if the key is in the info axis"""
return key in self._info_axis
so, your first use of in
is interpreted as:
'Adam' in df['name']._info_axis
This gives False
, expectedly, because df['name']._info_axis
actually contains information about the range/index
and not the data itself:
In [37]: df['name']._info_axis
Out[37]: RangeIndex(start=0, stop=3, step=1)
In [38]: list(df['name']._info_axis)
Out[38]: [0, 1, 2]
'Adam' in list(df['name'])
The use of list
, converts the pandas.Series
to a list of the values. So, the actual operation is this:
In [42]: list(df['name'])
Out[42]: ['Adam', 'Ben', 'Chris']
In [43]: 'Adam' in ['Adam', 'Ben', 'Chris']
Out[43]: True
Here are few more idiomatic ways to do what you want (with the associated speed):
In [56]: df.name.str.contains('Adam').any()
Out[56]: True
In [57]: timeit df.name.str.contains('Adam').any()
The slowest run took 6.25 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 144 µs per loop
In [58]: df.name.isin(['Adam']).any()
Out[58]: True
In [59]: timeit df.name.isin(['Adam']).any()
The slowest run took 5.13 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 191 µs per loop
In [60]: df.name.eq('Adam').any()
Out[60]: True
In [61]: timeit df.name.eq('Adam').any()
10000 loops, best of 3: 178 µs per loop
Note: the last way is also suggested by @Wen in the comment above
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With