I have a pandas dataframe whose entries are all strings:
   A     B      C 1 apple  banana pear 2 pear   pear   apple 3 banana pear   pear 4 apple  apple  pear   etc. I want to select all the rows that contain a certain string, say, 'banana'. I don't know which column it will appear in each time. Of course, I can write a for loop and iterate over all rows. But is there an easier or faster way to do this?
Method 1 : Using contains() Using the contains() function of strings to filter the rows. We are filtering the rows based on the 'Credit-Rating' column of the dataframe by converting it to string followed by the contains method of string class.
Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.
You can just select the columns with, for instance, a<-dat[,(dat[1,]) == 1] ; the only trick is re-setting the column names when you end up extracting a single column.
At the heart of selecting rows, we would need a 1D mask or a pandas-series of boolean elements of length same as length of df, let's call it mask. So, finally with df[mask], we would get the selected rows off df following boolean-indexing.
Here's our starting df :
In [42]: df Out[42]:          A       B      C 1   apple  banana   pear 2    pear    pear  apple 3  banana    pear   pear 4   apple   apple   pear  Now, if we need to match just one string, it's straight-foward with elementwise equality :
In [42]: df == 'banana' Out[42]:         A      B      C 1  False   True  False 2  False  False  False 3   True  False  False 4  False  False  False  If we need to look ANY one match in each row, use .any method :
In [43]: (df == 'banana').any(axis=1) Out[43]:  1     True 2    False 3     True 4    False dtype: bool  To select corresponding rows :
In [44]: df[(df == 'banana').any(axis=1)] Out[44]:          A       B     C 1   apple  banana  pear 3  banana    pear  pear  1. Search for ANY match
Here's our starting df :
In [42]: df Out[42]:          A       B      C 1   apple  banana   pear 2    pear    pear  apple 3  banana    pear   pear 4   apple   apple   pear  NumPy's np.isin would work here (or use pandas.isin as listed in other posts) to get all matches from the list of search strings in df. So, say we are looking for 'pear' or 'apple' in df :
In [51]: np.isin(df, ['pear','apple']) Out[51]:  array([[ True, False,  True],        [ True,  True,  True],        [False,  True,  True],        [ True,  True,  True]])  # ANY match along each row In [52]: np.isin(df, ['pear','apple']).any(axis=1) Out[52]: array([ True,  True,  True,  True])  # Select corresponding rows with masking In [56]: df[np.isin(df, ['pear','apple']).any(axis=1)] Out[56]:          A       B      C 1   apple  banana   pear 2    pear    pear  apple 3  banana    pear   pear 4   apple   apple   pear  2. Search for ALL match
Here's our starting df again :
In [42]: df Out[42]:          A       B      C 1   apple  banana   pear 2    pear    pear  apple 3  banana    pear   pear 4   apple   apple   pear  So, now we are looking for rows that have BOTH say ['pear','apple']. We will make use of NumPy-broadcasting :
In [66]: np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1) Out[66]:  array([[ True,  True],        [ True,  True],        [ True, False],        [ True,  True]])  So, we have a search list of 2 items and hence we have a 2D mask with number of rows = len(df) and number of cols = number of search items. Thus, in the above result, we have the first col for 'pear' and second one for 'apple'.
To make things concrete, let's get a mask for three items ['apple','banana', 'pear'] :
In [62]: np.equal.outer(df.to_numpy(copy=False),  ['apple','banana', 'pear']).any(axis=1) Out[62]:  array([[ True,  True,  True],        [ True, False,  True],        [False,  True,  True],        [ True, False,  True]])  The columns of this mask are for 'apple','banana', 'pear' respectively.
Back to 2 search items case, we had earlier :
In [66]: np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1) Out[66]:  array([[ True,  True],        [ True,  True],        [ True, False],        [ True,  True]])  Since, we are looking for ALL matches in each row :
In [67]: np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1).all(axis=1) Out[67]: array([ True,  True, False,  True])  Finally, select rows :
In [70]: df[np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1).all(axis=1)] Out[70]:         A       B      C 1  apple  banana   pear 2   pear    pear  apple 4  apple   apple   pear 
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With