Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: defining a union of regular expressions

Tags:

python

regex

I have a list of patterns like

list_patterns = [': error:', ': warning:', 'cc1plus:', 'undefine reference to']

what I want to do is to produce a union of all of them yielding a regular expression that matches every element in list_patterns [but presumably does not match any re not in list_patterns -- msw]

re.compile(list_patterns)

Is this possible?

like image 550
daydreamer Avatar asked Jul 18 '10 01:07

daydreamer


People also ask

How do you do a union in regex?

Operators used in regular expressions include: Union: If R1 and R2 are regular expressions, then R1 | R2 (also written as R1 U R2 or R1 + R2) is also a regular expression. L(R1|R2) = L(R1) U L(R2). Concatenation: If R1 and R2 are regular expressions, then R1R2 (also written as R1.

What is regualr expression Python?

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern.

How do you match a string in Python?

Use the string method startswith() for forward matching, i.e., whether a string starts with the specified string. You can also specify a tuple of strings. True is returned if the string starts with one of the elements of the tuple, and False is returned if the string does not start with any of them.


2 Answers

There are a couple of ways of doing this. The simplest is:

list_patterns = [': error:', ': warning:', 'cc1plus:', 'undefine reference to']
string = 'there is an : error: and a cc1plus: in this string'
print re.findall('|'.join(list_patterns), string)

Output:

[': error:', 'cc1plus:']

which is fine as long as concatenating your search patterns doesn't break the regex (eg if one of them contains a regex special character like an opening parenthesis). You can handle that this way:

list_patterns = [': error:', ': warning:', 'cc1plus:', 'undefine reference to']
string = 'there is an : error: and a cc1plus: in this string'
pattern = "|".join(re.escape(p) for p in list_patterns)
print re.findall(pattern, string)

Output is the same. But what this does is pass each pattern through re.escape() to escape any regex special characters.

Now which one you use depends on your list of patterns. Are they regular expressions and can thus be assumed to be valid? If so, the first would probably be appropriate. If they are strings, use the second method.

For the first, it gets more complicated however because by concatenating several regular expressions you may change the grouping and have other unintended side effects.

like image 50
cletus Avatar answered Sep 26 '22 17:09

cletus


list_regexs = [re.compile(x) for x in list_patterns]
like image 27
Ignacio Vazquez-Abrams Avatar answered Sep 25 '22 17:09

Ignacio Vazquez-Abrams