If I have a frame like this
frame = pd.DataFrame({
"a": ["the cat is blue", "the sky is green", "the dog is black"]
})
and I want to check if any of those rows contain a certain word I just have to do this.
frame["b"] = (
frame.a.str.contains("dog") |
frame.a.str.contains("cat") |
frame.a.str.contains("fish")
)
frame["b"]
outputs:
0 True
1 False
2 True
Name: b, dtype: bool
If I decide to make a list:
mylist = ["dog", "cat", "fish"]
How would I check that the rows contain a certain word in the list?
Using pandas.Series. isin() function is used to check whether a column contains a list of multiple values. It returns a boolean Series showing each element in the Series matches an element in the passed sequence of values exactly.
contains() function is used to test if pattern or regex is contained within a string of a Series or Index. The function returns boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.
To test if a string contains one of the substrings in a list in Python Pandas, we can use the str. contains method with a regex pattern to find all the matches.
frame = pd.DataFrame({'a' : ['the cat is blue', 'the sky is green', 'the dog is black']}) frame a 0 the cat is blue 1 the sky is green 2 the dog is black
The str.contains
method accepts a regular expression pattern:
mylist = ['dog', 'cat', 'fish'] pattern = '|'.join(mylist) pattern 'dog|cat|fish' frame.a.str.contains(pattern) 0 True 1 False 2 True Name: a, dtype: bool
Because regex patterns are supported, you can also embed flags:
frame = pd.DataFrame({'a' : ['Cat Mr. Nibbles is blue', 'the sky is green', 'the dog is black']}) frame a 0 Cat Mr. Nibbles is blue 1 the sky is green 2 the dog is black pattern = '|'.join([f'(?i){animal}' for animal in mylist]) # python 3.6+ pattern '(?i)dog|(?i)cat|(?i)fish' frame.a.str.contains(pattern) 0 True # Because of the (?i) flag, 'Cat' is also matched to 'cat' 1 False 2 True
For list should work
print(frame[frame["a"].isin(mylist)])
See pandas.DataFrame.isin()
.
After going through the comments of the accepted answer of extracting the string, this approach can also be tried.
frame = pd.DataFrame({'a' : ['the cat is blue', 'the sky is green', 'the dog is black']})
frame
a
0 the cat is blue
1 the sky is green
2 the dog is black
Let us create our list which will have strings that needs to be matched and extracted.
mylist = ['dog', 'cat', 'fish']
pattern = '|'.join(mylist)
Now let create a function which will be responsible to find and extract the substring.
import re
def pattern_searcher(search_str:str, search_list:str):
search_obj = re.search(search_list, search_str)
if search_obj :
return_str = search_str[search_obj.start(): search_obj.end()]
else:
return_str = 'NA'
return return_str
We will use this function with pandas.DataFrame.apply
frame['matched_str'] = frame['a'].apply(lambda x: pattern_searcher(search_str=x, search_list=pattern))
Result :
a matched_str
0 the cat is blue cat
1 the sky is green NA
2 the dog is black dog
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With