Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search strings using regular expression in Python

When I try to use regular expression for finding strings in other strings, it does not work as expected. Here is an example:

import re
message = 'I really like beer, but my favourite beer is German beer.'
keywords = ['beer', 'german beer', 'german']

regex = re.compile("|".join(keywords))
regex.findall(message.lower())

Result:

['beer', 'beer', 'german beer']

But the expected result would be:

['beer', 'beer', 'german beer', 'german']

Another way to do that could be:

results = []
for k in keywords:
    regex = re.compile(k)
    for r in regex.findall(message.lower()):
        results.append(r)

['beer', 'beer', 'beer', 'german beer', 'german']

It works like I want, but I think it is not the best way to do that. Can somebody help me?

like image 898
Adrian Avatar asked Dec 25 '12 18:12

Adrian


People also ask

What is RegEx string in Python?

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern.

How do you use regular expressions in Python?

Python has a module named re to work with RegEx. Here's an example: import re pattern = '^a...s$' test_string = 'abyss' result = re. match(pattern, test_string) if result: print("Search successful.") else: print("Search unsuccessful.")


2 Answers

re.findall cannot find overlapping matches. If you want to use regular expressions you will have to create separate expressions and run them in a loop as in your second example.

Note that your second example can also be shortened to the following, though it's a matter of taste whether you find this more readable:

results = [r for k in keywords for r in re.findall(k, message.lower())] 

Your specific example doesn't require the use of regular expressions. You should avoid using regular expressions if you just want to find fixed strings.

like image 160
Mark Byers Avatar answered Oct 25 '22 05:10

Mark Byers


re.findall is described in http://docs.python.org/2/library/re.html

"Return all non-overlapping matches of pattern in string..."

Non-overlapping means that for "german beer" it will not find "german beer" AND "german", because those matches are overlapping.

like image 27
Omri Barel Avatar answered Oct 25 '22 06:10

Omri Barel