Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Escaping [ in Python Regular Expressions

Tags:

python

regex

This reg exp search correctly checks to see if a string contains the text harry:

re.search(r'\bharry\b', '[harry] blah', re.IGNORECASE)

However, I need to ensure that the string contains [harry]. I have tried escaping with various numbers of back-slashes:

re.search(r'\b\[harry\]\b', '[harry] blah', re.IGNORECASE)
re.search(r'\b\\[harry\\]\b', '[harry] blah', re.IGNORECASE)
re.search(r'\b\\\[harry\\\]\b', '[harry] blah', re.IGNORECASE)

None of these solutions work find the match. What do I need to do?

like image 359
elgaz Avatar asked Aug 05 '10 10:08

elgaz


3 Answers

The first one is correct:

r'\b\[harry\]\b'

But this won’t match [harry] blah as [ is not a word character and so there is no word boundary. It would only match if there were a word character in front of [ like in foobar[harry] blah.

like image 177
Gumbo Avatar answered Sep 21 '22 01:09

Gumbo


>>> re.search(r'\bharry\b','[harry] blah',re.IGNORECASE)
<_sre.SRE_Match object at 0x7f14d22df648>
>>> re.search(r'\b\[harry\]\b','[harry] blah',re.IGNORECASE)
>>> re.search(r'\[harry\]','[harry] blah',re.IGNORECASE)
<_sre.SRE_Match object at 0x7f14d22df6b0>
>>> re.search(r'\[harry\]','harry blah',re.IGNORECASE)

The problem is the \b, not the brackets. A single backslash is correct for escaping.

like image 35
Mad Scientist Avatar answered Sep 22 '22 01:09

Mad Scientist


You escape it the way you escape most regex metacharacter: preceding with a backslash.

Thus, r"\[harry\]" will match a literal string [harry].

The problem is with the \b in your pattern. This is the word boundary anchor.

The \b matches:

  • At the beginning of the string, if it starts with a word character
  • At the end of the string, if it ends with a word character
  • Between a word character \w and a non-word character \W (note the case difference)

The brackets [ and ] are NOT word characters, thus if a string starts with [, there is no \b to its left. Any where there is no \b, there is \B instead (note the case difference).

References

  • regular-expressions.info/Word Boundaries
  • http://docs.python.org/library/re.html

    \b : Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that \b is defined as the boundary between \w and \W, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.

like image 22
polygenelubricants Avatar answered Sep 19 '22 01:09

polygenelubricants