Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using in operator with Pandas series [duplicate]

Tags:

pandas

Why can't I match a string in a Pandas series using in? In the following example, the first evaluation results in False unexpectedly, but the second one works.

df = pd.DataFrame({'name': [ 'Adam', 'Ben', 'Chris' ]})
'Adam' in df['name']
'Adam' in list(df['name'])
like image 413
ceiling cat Avatar asked Mar 20 '18 19:03

ceiling cat


People also ask

What does the IN operator do in pandas?

Pandas isin() method is used to filter data frames. isin() method helps in selecting rows with having a particular(or Multiple) value in a particular column. Parameters: values: iterable, Series, List, Tuple, DataFrame or dictionary to check in the caller Series/Data Frame.

How do you keep duplicates in pandas?

By using 'last', the last occurrence of each set of duplicated values is set on False and all others on True. By setting keep on False, all duplicates are True. To find duplicates on specific column(s), use subset .

How do you check if something is in a pandas Series?

isin() function check whether values are contained in Series. It returns a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly.


1 Answers

In the first case:

Because the in operator is interpreted as a call to df['name'].__contains__('Adam'). If you look at the implementation of __contains__ in pandas.Series, you will find that it's the following (inhereted from pandas.core.generic.NDFrame) :

def __contains__(self, key):
    """True if the key is in the info axis"""
    return key in self._info_axis

so, your first use of in is interpreted as:

'Adam' in df['name']._info_axis 

This gives False, expectedly, because df['name']._info_axis actually contains information about the range/index and not the data itself:

In [37]: df['name']._info_axis 
Out[37]: RangeIndex(start=0, stop=3, step=1)

In [38]: list(df['name']._info_axis) 
Out[38]: [0, 1, 2]

In the second case:

'Adam' in list(df['name'])

The use of list, converts the pandas.Series to a list of the values. So, the actual operation is this:

In [42]: list(df['name'])
Out[42]: ['Adam', 'Ben', 'Chris']

In [43]: 'Adam' in ['Adam', 'Ben', 'Chris']
Out[43]: True

Here are few more idiomatic ways to do what you want (with the associated speed):

In [56]: df.name.str.contains('Adam').any()
Out[56]: True

In [57]: timeit df.name.str.contains('Adam').any()
The slowest run took 6.25 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 144 µs per loop

In [58]: df.name.isin(['Adam']).any()
Out[58]: True

In [59]: timeit df.name.isin(['Adam']).any()
The slowest run took 5.13 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 191 µs per loop

In [60]: df.name.eq('Adam').any()
Out[60]: True

In [61]: timeit df.name.eq('Adam').any()
10000 loops, best of 3: 178 µs per loop

Note: the last way is also suggested by @Wen in the comment above

like image 134
Mohamed Ali JAMAOUI Avatar answered Nov 13 '22 06:11

Mohamed Ali JAMAOUI