I have a pandas dataframe whose entries are all strings:
A B C 1 apple banana pear 2 pear pear apple 3 banana pear pear 4 apple apple pear
etc. I want to select all the rows that contain a certain string, say, 'banana'. I don't know which column it will appear in each time. Of course, I can write a for loop and iterate over all rows. But is there an easier or faster way to do this?
Method 1 : Using contains() Using the contains() function of strings to filter the rows. We are filtering the rows based on the 'Credit-Rating' column of the dataframe by converting it to string followed by the contains method of string class.
Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.
You can just select the columns with, for instance, a<-dat[,(dat[1,]) == 1] ; the only trick is re-setting the column names when you end up extracting a single column.
At the heart of selecting rows, we would need a 1D mask or a pandas-series of boolean elements of length same as length of df
, let's call it mask
. So, finally with df[mask]
, we would get the selected rows off df
following boolean-indexing.
Here's our starting df
:
In [42]: df Out[42]: A B C 1 apple banana pear 2 pear pear apple 3 banana pear pear 4 apple apple pear
Now, if we need to match just one string, it's straight-foward with elementwise equality :
In [42]: df == 'banana' Out[42]: A B C 1 False True False 2 False False False 3 True False False 4 False False False
If we need to look ANY
one match in each row, use .any
method :
In [43]: (df == 'banana').any(axis=1) Out[43]: 1 True 2 False 3 True 4 False dtype: bool
To select corresponding rows :
In [44]: df[(df == 'banana').any(axis=1)] Out[44]: A B C 1 apple banana pear 3 banana pear pear
1. Search for ANY
match
Here's our starting df
:
In [42]: df Out[42]: A B C 1 apple banana pear 2 pear pear apple 3 banana pear pear 4 apple apple pear
NumPy's np.isin
would work here (or use pandas.isin as listed in other posts) to get all matches from the list of search strings in df
. So, say we are looking for 'pear'
or 'apple'
in df
:
In [51]: np.isin(df, ['pear','apple']) Out[51]: array([[ True, False, True], [ True, True, True], [False, True, True], [ True, True, True]]) # ANY match along each row In [52]: np.isin(df, ['pear','apple']).any(axis=1) Out[52]: array([ True, True, True, True]) # Select corresponding rows with masking In [56]: df[np.isin(df, ['pear','apple']).any(axis=1)] Out[56]: A B C 1 apple banana pear 2 pear pear apple 3 banana pear pear 4 apple apple pear
2. Search for ALL
match
Here's our starting df
again :
In [42]: df Out[42]: A B C 1 apple banana pear 2 pear pear apple 3 banana pear pear 4 apple apple pear
So, now we are looking for rows that have BOTH
say ['pear','apple']
. We will make use of NumPy-broadcasting
:
In [66]: np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1) Out[66]: array([[ True, True], [ True, True], [ True, False], [ True, True]])
So, we have a search list of 2
items and hence we have a 2D mask with number of rows = len(df)
and number of cols = number of search items
. Thus, in the above result, we have the first col for 'pear'
and second one for 'apple'
.
To make things concrete, let's get a mask for three items ['apple','banana', 'pear']
:
In [62]: np.equal.outer(df.to_numpy(copy=False), ['apple','banana', 'pear']).any(axis=1) Out[62]: array([[ True, True, True], [ True, False, True], [False, True, True], [ True, False, True]])
The columns of this mask are for 'apple','banana', 'pear'
respectively.
Back to 2
search items case, we had earlier :
In [66]: np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1) Out[66]: array([[ True, True], [ True, True], [ True, False], [ True, True]])
Since, we are looking for ALL
matches in each row :
In [67]: np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1).all(axis=1) Out[67]: array([ True, True, False, True])
Finally, select rows :
In [70]: df[np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1).all(axis=1)] Out[70]: A B C 1 apple banana pear 2 pear pear apple 4 apple apple pear
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With