I have a df like this:
frame = pd.DataFrame({'a' : ['a,b,c', 'a,c,f', 'b,d,f','a,z,c']})
And a list of items:
letters = ['a','c']
My goal is to get all the rows from frame
that contain at least the 2 elements in letters
I came up with this solution:
for i in letters: subframe = frame[frame['a'].str.contains(i)]
This gives me what I want, but it might not be the best solution in terms of scalability. Is there any 'vectorised' solution? Thanks
You can check if a column contains/exists a particular value (string/int), list of multiple values in pandas DataFrame by using pd. series() , in operator, pandas. series. isin() , str.
Pandas series can be converted to a list using tolist() or type casting method. There can be situations when you want to perform operations on a list instead of a pandas object. In such cases, you can store the DataFrame columns in a list and perform the required operations.
The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.
I would build a list of Series, and then apply a vectorized np.all
:
contains = [frame['a'].str.contains(i) for i in letters] resul = frame[np.all(contains, axis=0)]
It gives as expected:
a 0 a,b,c 1 a,c,f 3 a,z,c
One way is to split the column values into lists using str.split
, and check if set(letters)
is a subset
of the obtained lists:
letters_s = set(letters) frame[frame.a.str.split(',').map(letters_s.issubset)] a 0 a,b,c 1 a,c,f 3 a,z,c
Benchmark:
def serge(frame): contains = [frame['a'].str.contains(i) for i in letters] return frame[np.all(contains, axis=0)] def yatu(frame): letters_s = set(letters) return frame[frame.a.str.split(',').map(letters_s.issubset)] def austin(frame): mask = frame.a.apply(lambda x: np.intersect1d(x.split(','), letters).size > 0) return frame[mask] def datanovice(frame): s = frame['a'].str.split(',').explode().isin(letters).groupby(level=0).cumsum() return frame.loc[s[s.ge(2)].index.unique()] perfplot.show( setup=lambda n: pd.concat([frame]*n, axis=0).reset_index(drop=True), kernels=[ lambda df: serge(df), lambda df: yatu(df), lambda df: df[df['a'].apply(lambda x: np.all([*map(lambda l: l in x, letters)]))], lambda df: austin(df), lambda df: datanovice(df), ], labels=['serge', 'yatu', 'bruno','austin', 'datanovice'], n_range=[2**k for k in range(0, 18)], equality_check=lambda x, y: x.equals(y), xlabel='N' )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With