I have a pandas dataframe that I'd like to filter by a specific word (test) in a column. I tried:
df[df[col].str.contains('test')]
But it returns an empty dataframe with just the column names. For the output, I'm looking for a dataframe that'd contain all rows that contain the word 'test'. What can I do?
EDIT (to add samples):
data = pd.read_csv(/...csv)
data has 5 cols, including 'BusinessDescription'
, and I want to extract all rows that have the word 'dental' (case insensitive) in the Business Description
col, so I used:
filtered = data[data['BusinessDescription'].str.contains('dental')==True]
and I get an empty dataframe, with just the header names of the 5 cols.
Pandas str. find() method is used to search a substring in each string present in a series. If the string is found, it returns the lowest index of its occurrence. If string is not found, it will return -1.
It seems you need parameter flags
in contains
:
import re
filtered = data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]
Another solution, thanks Anton vBR is convert to lowercase first:
filtered = data[data['BusinessDescription'].str.lower().str.contains('dental')]
Example:
For future programming I'd recommend using the keyword df instead of data when refering to dataframes. It is the common way around SO to use that notation.
import pandas as pd
data = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
df = pd.DataFrame(data)
df[df['BusinessDescription'].str.lower().str.contains('dental')]
BusinessDescription
0 dental fluss
1 DENTAL
Timings:
d = dict(BusinessDescription=['dental fluss','DENTAL','Dentist'])
data = pd.DataFrame(d)
data = pd.concat([data]*10000).reset_index(drop=True)
#print (data)
In [122]: %timeit data[data['BusinessDescription'].str.contains('dental', flags = re.IGNORECASE)]
10 loops, best of 3: 28.9 ms per loop
In [123]: %timeit data[data['BusinessDescription'].str.lower().str.contains('dental')]
10 loops, best of 3: 32.6 ms per loop
Caveat:
Performance really depend on the data - size of DataFrame
and number of values matching condition.
Keep the string enclosed in quotes.
df[df['col'].str.contains('test')]
Thanks
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With