Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if pandas column contains all elements from a list

Tags:

I have a df like this:

frame = pd.DataFrame({'a' : ['a,b,c', 'a,c,f', 'b,d,f','a,z,c']}) 

And a list of items:

letters = ['a','c'] 

My goal is to get all the rows from frame that contain at least the 2 elements in letters

I came up with this solution:

for i in letters:     subframe = frame[frame['a'].str.contains(i)] 

This gives me what I want, but it might not be the best solution in terms of scalability. Is there any 'vectorised' solution? Thanks

like image 608
AVal Avatar asked Mar 30 '20 13:03

AVal


People also ask

How do you check if a list of items is present in a DataFrame column?

You can check if a column contains/exists a particular value (string/int), list of multiple values in pandas DataFrame by using pd. series() , in operator, pandas. series. isin() , str.

What is Tolist () in pandas?

Pandas series can be converted to a list using tolist() or type casting method. There can be situations when you want to perform operations on a list instead of a pandas object. In such cases, you can store the DataFrame columns in a list and perform the required operations.

What does .values do in pandas?

The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.


2 Answers

I would build a list of Series, and then apply a vectorized np.all:

contains = [frame['a'].str.contains(i) for i in letters] resul = frame[np.all(contains, axis=0)] 

It gives as expected:

       a 0  a,b,c 1  a,c,f 3  a,z,c 
like image 79
Serge Ballesta Avatar answered Oct 03 '22 20:10

Serge Ballesta


One way is to split the column values into lists using str.split, and check if set(letters) is a subset of the obtained lists:

letters_s = set(letters) frame[frame.a.str.split(',').map(letters_s.issubset)]       a 0  a,b,c 1  a,c,f 3  a,z,c ​ 

Benchmark:

def serge(frame):     contains = [frame['a'].str.contains(i) for i in letters]     return frame[np.all(contains, axis=0)]  def yatu(frame):     letters_s = set(letters)     return frame[frame.a.str.split(',').map(letters_s.issubset)]  def austin(frame):     mask =  frame.a.apply(lambda x: np.intersect1d(x.split(','), letters).size > 0)     return frame[mask]  def datanovice(frame):     s = frame['a'].str.split(',').explode().isin(letters).groupby(level=0).cumsum()     return frame.loc[s[s.ge(2)].index.unique()]  perfplot.show(     setup=lambda n: pd.concat([frame]*n, axis=0).reset_index(drop=True),       kernels=[         lambda df: serge(df),         lambda df: yatu(df),         lambda df: df[df['a'].apply(lambda x: np.all([*map(lambda l: l in x, letters)]))],         lambda df: austin(df),         lambda df: datanovice(df),     ],      labels=['serge', 'yatu', 'bruno','austin', 'datanovice'],     n_range=[2**k for k in range(0, 18)],     equality_check=lambda x, y: x.equals(y),     xlabel='N' ) 

enter image description here

like image 35
yatu Avatar answered Oct 03 '22 20:10

yatu