Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to filter pandas dataframe by string?

I have a pandas dataframe that I'd like to filter by a specific word (test) in a column. I tried:

df[df[col].str.contains('test')]

But it returns an empty dataframe with just the column names. For the output, I'm looking for a dataframe that'd contain all rows that contain the word 'test'. What can I do?

EDIT (to add samples):

data = pd.read_csv(/...csv)

data has 5 cols, including 'BusinessDescription', and I want to extract all rows that have the word 'dental' (case insensitive) in the Business Description col, so I used:

filtered = data[data['BusinessDescription'].str.contains('dental')==True]

and I get an empty dataframe, with just the header names of the 5 cols.

like image 351
eh2699 Avatar asked Dec 29 '17 09:12

eh2699


People also ask

How do I get strings in pandas?

Pandas str. find() method is used to search a substring in each string present in a series. If the string is found, it returns the lowest index of its occurrence. If string is not found, it will return -1.


Video Answer


2 Answers

It seems you need parameter flags in contains:

import re

filtered = data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]

Another solution, thanks Anton vBR is convert to lowercase first:

filtered = data[data['BusinessDescription'].str.lower().str.contains('dental')]

Example:
For future programming I'd recommend using the keyword df instead of data when refering to dataframes. It is the common way around SO to use that notation.

import pandas as pd

data = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
df = pd.DataFrame(data)
df[df['BusinessDescription'].str.lower().str.contains('dental')]

  BusinessDescription
0        dental fluss
1              DENTAL

Timings:

d = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
data = pd.DataFrame(d)
data = pd.concat([data]*10000).reset_index(drop=True)

#print (data)

In [122]: %timeit data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]
10 loops, best of 3: 28.9 ms per loop

In [123]: %timeit data[data['BusinessDescription'].str.lower().str.contains('dental')]
10 loops, best of 3: 32.6 ms per loop

Caveat:

Performance really depend on the data - size of DataFrame and number of values matching condition.

like image 134
jezrael Avatar answered Sep 21 '22 12:09

jezrael


Keep the string enclosed in quotes.

df[df['col'].str.contains('test')]

Thanks

like image 26
Nephilim Avatar answered Sep 22 '22 12:09

Nephilim