Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select rows in Pandas dataframe based on string matching in multiple columns

I don't think this exact question has been answered yet, so here goes.

I have a Pandas data frame, and I want to select all rows that contain a string in column A or column B.

Say the dataframe looks like this:

d = {'id':["1", "2", "3", "4"], 
     'title': ["Horses are good", "Cats are bad", "Frogs are nice", "Turkeys are the best"], 
     'description':["Horse epitome", "Cats bad but horses good", "Frog fancier", "Turkey tome, not about horses"],
     'tags':["horse, cat, frog, turkey", "horse, cat, frog, turkey", "horse, cat, frog, turkey", "horse, cat, frog, turkey"],
     'date':["2019-01-01", "2019-10-01", "2018-08-14", "2016-11-29"]}

dataframe  = pandas.DataFrame(d)

Which gives:

id              title                      description               tag           date
1   "Horses are good"                  "Horse epitome"       "horse, cat"    2019-01-01
2      "Cats are bad"                       "Cats bad"       "horse, cat"    2019-10-01
3    "Frogs are nice"      "Frog fancier, horses good"      "horse, frog"    2018-08-14
4   "Turkey are best"                    "Turkey tome"    "turkey, horse"    2016-11-29

Let's say I want to create a new dataframe containing rows with the string horse (ignoring capitalisation) in the column title OR the column description, but not in the column tag (or any other column).

The result should be (row 2 and 4 get dropped):

id                title                     description                 tag          date  
1     "Horses are good"                  "Horse epitome"       "horse, cat"    2019-01-01
3      "Frogs are nice"      "Frog fancier, horses good"      "horse, frog"    2018-08-14

I have seen a few answers for one column, such as something like:

dataframe[dataframe['title'].str.contains('horse')]

But I am not sure (1) how to add multiple columns to this statement and (2) how to modify it with something like string.lower() to remove capitals in the column values for the string match.

Thanks in advance!

like image 848
arranjdavis Avatar asked Oct 25 '19 13:10

arranjdavis


People also ask

How to select rows in pandas Dataframe based on conditions?

Selecting rows in pandas DataFrame based on conditions Selecting rows based on particular column value using '>', '=', '=', '<=', '!=' operator. Selecting those rows whose column value is present in the list using isin() method of the dataframe. Selecting rows based on multiple column conditions using '&' operator.

How many rows are there in a Dataframe in Python?

Table 1 illustrates the output of the Python console and shows that our exemplifying data is made of six rows and three columns. This example shows how to get rows of a pandas DataFrame that have a certain value in a column of this DataFrame. In this specific example, we are selecting all rows where the column x3 is equal to the value 1.

How to select rows based on multiple column conditions in Excel?

Selecting rows based on multiple column conditions using '&' operator. Code #1 : Selecting all the rows from the given dataframe in which ‘Age’ is equal to 21 and ‘Stream’ is present in the options list using basic method.

How to create a Boolean filter in a pandas Dataframe?

You can write a function to be applied to each value in the States/cities column. Have the function return either True or False, and the result of applying the function can act as a Boolean filter on your DataFrame. This is a common pattern when working with pandas.


1 Answers

If want specify columns for test one possible solution is join all columns and then test with Series.str.contains and case=False:

s = dataframe['title'] + dataframe['description']
df = dataframe[s.str.contains('horse', case=False)]

Or create conditions for each column and chain them by bitwise OR with |:

df = dataframe[dataframe['title'].str.contains('horse', case=False) | 
               dataframe['description'].str.contains('horse', case=False)]

Also if want specify column column for not test chain solution with bitwise AND with invert condition by ~ for NOT MATCH:

df = dataframe[s.str.contains('horse', case=False) &
               ~dataframe['tags'].str.contains('horse', case=False)]

For second solution add () around all columns with chained by OR:

df = dataframe[(dataframe['title'].str.contains('horse', case=False) | 
               dataframe['description'].str.contains('horse', case=False)) &
              ~dataframe['tags'].str.contains('horse', case=False)]]

EDIT:

Like @WeNYoBen commented you can add DataFrame.copy to end for prevent SettingWithCopyWarning like:

s = dataframe['title'] + dataframe['description']
df = dataframe[s.str.contains('horse', case=False)].copy()
like image 130
jezrael Avatar answered Nov 02 '22 19:11

jezrael