Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select rows containing certain values from pandas dataframe

Tags:

I have a pandas dataframe whose entries are all strings:

   A     B      C 1 apple  banana pear 2 pear   pear   apple 3 banana pear   pear 4 apple  apple  pear 

etc. I want to select all the rows that contain a certain string, say, 'banana'. I don't know which column it will appear in each time. Of course, I can write a for loop and iterate over all rows. But is there an easier or faster way to do this?

like image 424
ylangylang Avatar asked Jul 04 '16 13:07

ylangylang


People also ask

How do you select a row with specific text in Python?

Method 1 : Using contains() Using the contains() function of strings to filter the rows. We are filtering the rows based on the 'Credit-Rating' column of the dataframe by converting it to string followed by the contains method of string class.

How do I filter specific rows from a DataFrame?

Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.

How do you select columns based on row values?

You can just select the columns with, for instance, a<-dat[,(dat[1,]) == 1] ; the only trick is re-setting the column names when you end up extracting a single column.


1 Answers

Introduction

At the heart of selecting rows, we would need a 1D mask or a pandas-series of boolean elements of length same as length of df, let's call it mask. So, finally with df[mask], we would get the selected rows off df following boolean-indexing.

Here's our starting df :

In [42]: df Out[42]:          A       B      C 1   apple  banana   pear 2    pear    pear  apple 3  banana    pear   pear 4   apple   apple   pear 

I. Match one string

Now, if we need to match just one string, it's straight-foward with elementwise equality :

In [42]: df == 'banana' Out[42]:         A      B      C 1  False   True  False 2  False  False  False 3   True  False  False 4  False  False  False 

If we need to look ANY one match in each row, use .any method :

In [43]: (df == 'banana').any(axis=1) Out[43]:  1     True 2    False 3     True 4    False dtype: bool 

To select corresponding rows :

In [44]: df[(df == 'banana').any(axis=1)] Out[44]:          A       B     C 1   apple  banana  pear 3  banana    pear  pear 

II. Match multiple strings

1. Search for ANY match

Here's our starting df :

In [42]: df Out[42]:          A       B      C 1   apple  banana   pear 2    pear    pear  apple 3  banana    pear   pear 4   apple   apple   pear 

NumPy's np.isin would work here (or use pandas.isin as listed in other posts) to get all matches from the list of search strings in df. So, say we are looking for 'pear' or 'apple' in df :

In [51]: np.isin(df, ['pear','apple']) Out[51]:  array([[ True, False,  True],        [ True,  True,  True],        [False,  True,  True],        [ True,  True,  True]])  # ANY match along each row In [52]: np.isin(df, ['pear','apple']).any(axis=1) Out[52]: array([ True,  True,  True,  True])  # Select corresponding rows with masking In [56]: df[np.isin(df, ['pear','apple']).any(axis=1)] Out[56]:          A       B      C 1   apple  banana   pear 2    pear    pear  apple 3  banana    pear   pear 4   apple   apple   pear 

2. Search for ALL match

Here's our starting df again :

In [42]: df Out[42]:          A       B      C 1   apple  banana   pear 2    pear    pear  apple 3  banana    pear   pear 4   apple   apple   pear 

So, now we are looking for rows that have BOTH say ['pear','apple']. We will make use of NumPy-broadcasting :

In [66]: np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1) Out[66]:  array([[ True,  True],        [ True,  True],        [ True, False],        [ True,  True]]) 

So, we have a search list of 2 items and hence we have a 2D mask with number of rows = len(df) and number of cols = number of search items. Thus, in the above result, we have the first col for 'pear' and second one for 'apple'.

To make things concrete, let's get a mask for three items ['apple','banana', 'pear'] :

In [62]: np.equal.outer(df.to_numpy(copy=False),  ['apple','banana', 'pear']).any(axis=1) Out[62]:  array([[ True,  True,  True],        [ True, False,  True],        [False,  True,  True],        [ True, False,  True]]) 

The columns of this mask are for 'apple','banana', 'pear' respectively.

Back to 2 search items case, we had earlier :

In [66]: np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1) Out[66]:  array([[ True,  True],        [ True,  True],        [ True, False],        [ True,  True]]) 

Since, we are looking for ALL matches in each row :

In [67]: np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1).all(axis=1) Out[67]: array([ True,  True, False,  True]) 

Finally, select rows :

In [70]: df[np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1).all(axis=1)] Out[70]:         A       B      C 1  apple  banana   pear 2   pear    pear  apple 4  apple   apple   pear 
like image 57
Divakar Avatar answered Sep 24 '22 05:09

Divakar