I have a pandas dataframe whose entries are all strings:
A B C 1 apple banana pear 2 pear pear apple 3 banana pear pear 4 apple apple pear etc. I want to select all the rows that contain a certain string, say, 'banana'. I don't know which column it will appear in each time. Of course, I can write a for loop and iterate over all rows. But is there an easier or faster way to do this?
Method 1 : Using contains() Using the contains() function of strings to filter the rows. We are filtering the rows based on the 'Credit-Rating' column of the dataframe by converting it to string followed by the contains method of string class.
Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.
You can just select the columns with, for instance, a<-dat[,(dat[1,]) == 1] ; the only trick is re-setting the column names when you end up extracting a single column.
At the heart of selecting rows, we would need a 1D mask or a pandas-series of boolean elements of length same as length of df, let's call it mask. So, finally with df[mask], we would get the selected rows off df following boolean-indexing.
Here's our starting df :
In [42]: df Out[42]: A B C 1 apple banana pear 2 pear pear apple 3 banana pear pear 4 apple apple pear Now, if we need to match just one string, it's straight-foward with elementwise equality :
In [42]: df == 'banana' Out[42]: A B C 1 False True False 2 False False False 3 True False False 4 False False False If we need to look ANY one match in each row, use .any method :
In [43]: (df == 'banana').any(axis=1) Out[43]: 1 True 2 False 3 True 4 False dtype: bool To select corresponding rows :
In [44]: df[(df == 'banana').any(axis=1)] Out[44]: A B C 1 apple banana pear 3 banana pear pear 1. Search for ANY match
Here's our starting df :
In [42]: df Out[42]: A B C 1 apple banana pear 2 pear pear apple 3 banana pear pear 4 apple apple pear NumPy's np.isin would work here (or use pandas.isin as listed in other posts) to get all matches from the list of search strings in df. So, say we are looking for 'pear' or 'apple' in df :
In [51]: np.isin(df, ['pear','apple']) Out[51]: array([[ True, False, True], [ True, True, True], [False, True, True], [ True, True, True]]) # ANY match along each row In [52]: np.isin(df, ['pear','apple']).any(axis=1) Out[52]: array([ True, True, True, True]) # Select corresponding rows with masking In [56]: df[np.isin(df, ['pear','apple']).any(axis=1)] Out[56]: A B C 1 apple banana pear 2 pear pear apple 3 banana pear pear 4 apple apple pear 2. Search for ALL match
Here's our starting df again :
In [42]: df Out[42]: A B C 1 apple banana pear 2 pear pear apple 3 banana pear pear 4 apple apple pear So, now we are looking for rows that have BOTH say ['pear','apple']. We will make use of NumPy-broadcasting :
In [66]: np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1) Out[66]: array([[ True, True], [ True, True], [ True, False], [ True, True]]) So, we have a search list of 2 items and hence we have a 2D mask with number of rows = len(df) and number of cols = number of search items. Thus, in the above result, we have the first col for 'pear' and second one for 'apple'.
To make things concrete, let's get a mask for three items ['apple','banana', 'pear'] :
In [62]: np.equal.outer(df.to_numpy(copy=False), ['apple','banana', 'pear']).any(axis=1) Out[62]: array([[ True, True, True], [ True, False, True], [False, True, True], [ True, False, True]]) The columns of this mask are for 'apple','banana', 'pear' respectively.
Back to 2 search items case, we had earlier :
In [66]: np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1) Out[66]: array([[ True, True], [ True, True], [ True, False], [ True, True]]) Since, we are looking for ALL matches in each row :
In [67]: np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1).all(axis=1) Out[67]: array([ True, True, False, True]) Finally, select rows :
In [70]: df[np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1).all(axis=1)] Out[70]: A B C 1 apple banana pear 2 pear pear apple 4 apple apple pear
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With