When I try to use regular expression for finding strings in other strings, it does not work as expected. Here is an example:
import re
message = 'I really like beer, but my favourite beer is German beer.'
keywords = ['beer', 'german beer', 'german']
regex = re.compile("|".join(keywords))
regex.findall(message.lower())
Result:
['beer', 'beer', 'german beer']
But the expected result would be:
['beer', 'beer', 'german beer', 'german']
Another way to do that could be:
results = []
for k in keywords:
regex = re.compile(k)
for r in regex.findall(message.lower()):
results.append(r)
['beer', 'beer', 'beer', 'german beer', 'german']
It works like I want, but I think it is not the best way to do that. Can somebody help me?
A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern.
Python has a module named re to work with RegEx. Here's an example: import re pattern = '^a...s$' test_string = 'abyss' result = re. match(pattern, test_string) if result: print("Search successful.") else: print("Search unsuccessful.")
re.findall
cannot find overlapping matches. If you want to use regular expressions you will have to create separate expressions and run them in a loop as in your second example.
Note that your second example can also be shortened to the following, though it's a matter of taste whether you find this more readable:
results = [r for k in keywords for r in re.findall(k, message.lower())]
Your specific example doesn't require the use of regular expressions. You should avoid using regular expressions if you just want to find fixed strings.
re.findall
is described in http://docs.python.org/2/library/re.html
"Return all non-overlapping matches of pattern in string..."
Non-overlapping means that for "german beer" it will not find "german beer" AND "german", because those matches are overlapping.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With