I have a df like this: <pre class="prettyprint"><code>frame = pd.DataFrame({'a' : ['a,b,c', 'a,c,f', 'b,d,f','a,z,c']}) </code></pre> And a list of items: <pre class="prettyprint"><code>letters = ['a','c'] </code></pre> My goal is to get all the rows from <code>frame</code> that contain at least the 2 elements in <code>letters</code> I came up with this solution: <pre class="prettyprint"><code>for i in letters: subframe = frame[frame['a'].str.contains(i)] </code></pre> This gives me what I want, but it might not be the best solution in terms of scalability. Is there any 'vectorised' solution? Thanks

One way is to split the column values into lists using <code>str.split</code>, and check if <code>set(letters)</code> is a <code>subset</code> of the obtained lists: <pre class="prettyprint"><code>letters_s = set(letters) frame[frame.a.str.split(',').map(letters_s.issubset)] a 0 a,b,c 1 a,c,f 3 a,z,c </code></pre> <hr> Benchmark: <pre class="prettyprint"><code>def serge(frame): contains = [frame['a'].str.contains(i) for i in letters] return frame[np.all(contains, axis=0)] def yatu(frame): letters_s = set(letters) return frame[frame.a.str.split(',').map(letters_s.issubset)] def austin(frame): mask = frame.a.apply(lambda x: np.intersect1d(x.split(','), letters).size > 0) return frame[mask] def datanovice(frame): s = frame['a'].str.split(',').explode().isin(letters).groupby(level=0).cumsum() return frame.loc[s[s.ge(2)].index.unique()] perfplot.show( setup=lambda n: pd.concat([frame]*n, axis=0).reset_index(drop=True), kernels=[ lambda df: serge(df), lambda df: yatu(df), lambda df: df[df['a'].apply(lambda x: np.all([*map(lambda l: l in x, letters)]))], lambda df: austin(df), lambda df: datanovice(df), ], labels=['serge', 'yatu', 'bruno','austin', 'datanovice'], n_range=[2**k for k in range(0, 18)], equality_check=lambda x, y: x.equals(y), xlabel='N' ) </code></pre> <img src="https://i.stack.imgur.com/36ObP.png" alt="enter image description here">

Check if pandas column contains all elements from a list

Tags:

I have a df like this:

frame = pd.DataFrame({'a' : ['a,b,c', 'a,c,f', 'b,d,f','a,z,c']})

And a list of items:

letters = ['a','c']

My goal is to get all the rows from frame that contain at least the 2 elements in letters

I came up with this solution:

for i in letters:     subframe = frame[frame['a'].str.contains(i)]

This gives me what I want, but it might not be the best solution in terms of scalability. Is there any 'vectorised' solution? Thanks

608

asked Mar 30 '20 13:03

AVal

2 Answers

I would build a list of Series, and then apply a vectorized np.all:

contains = [frame['a'].str.contains(i) for i in letters] resul = frame[np.all(contains, axis=0)]

It gives as expected:

       a 0  a,b,c 1  a,c,f 3  a,z,c

answered Oct 03 '22 20:10

Serge Ballesta

One way is to split the column values into lists using str.split, and check if set(letters) is a subset of the obtained lists:

letters_s = set(letters) frame[frame.a.str.split(',').map(letters_s.issubset)]       a 0  a,b,c 1  a,c,f 3  a,z,c

Benchmark:

def serge(frame):     contains = [frame['a'].str.contains(i) for i in letters]     return frame[np.all(contains, axis=0)]  def yatu(frame):     letters_s = set(letters)     return frame[frame.a.str.split(',').map(letters_s.issubset)]  def austin(frame):     mask =  frame.a.apply(lambda x: np.intersect1d(x.split(','), letters).size > 0)     return frame[mask]  def datanovice(frame):     s = frame['a'].str.split(',').explode().isin(letters).groupby(level=0).cumsum()     return frame.loc[s[s.ge(2)].index.unique()]  perfplot.show(     setup=lambda n: pd.concat([frame]*n, axis=0).reset_index(drop=True),       kernels=[         lambda df: serge(df),         lambda df: yatu(df),         lambda df: df[df['a'].apply(lambda x: np.all([*map(lambda l: l in x, letters)]))],         lambda df: austin(df),         lambda df: datanovice(df),     ],      labels=['serge', 'yatu', 'bruno','austin', 'datanovice'],     n_range=[2**k for k in range(0, 18)],     equality_check=lambda x, y: x.equals(y),     xlabel='N' )

enter image description here

answered Oct 03 '22 20:10

yatu

Related questions
                            
                                Programming against interfaces: Do you write interfaces for all your domain classes?
                            
                                What is "for" in Ruby
                            
                                Scatter Plots in C++ [closed]
                            
                                How can I bind a List collection to TabControl headers in WPF?
                            
                                Detecting when the mouse is not moving
                            
                                How to do a Chrome/Opera specific stylesheet?
                            
                                Quick way to grant Exec permissions to DB role for many stored procs
                            
                                Could we save delegates in a file (C#)
                            
                                SVN diff across 2 different repositories
                            
                                Programming languages that define the problem instead of the solution?
                            
                                C#: How to set default value for a property in a partial class?
                            
                                Instance methods in modules

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With