I want to use input from a user as a regex pattern for a search over some text. It works, but how I can handle cases where user puts characters that have meaning in regex?
For example, the user wants to search for Word (s)
: regex engine will take the (s)
as a group. I want it to treat it like a string "(s)"
. I can run replace
on user input and replace the (
with \(
and the )
with \)
but the problem is I will need to do replace for every possible regex symbol.
Do you know some better way ?
Now, escaping a string (in regex terms) means finding all of the characters with special meaning and putting a backslash in front of them, including in front of other backslash characters. When you've done this one time on the string, you have officially "escaped the string".
To insert characters that are illegal in a string, use an escape character. An escape character is a backslash \ followed by the character you want to insert.
We should double for a backslash escape a forward slash / in a regular expression. A backslash \ is used to denote character classes, e.g. \d . So it's a special character in regular expression (just like in regular strings).
Use the re.escape()
function for this:
4.2.3 re
Module Contents
escape(string)
Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
A simplistic example, search any occurence of the provided string optionally followed by 's', and return the match object.
def simplistic_plural(word, text):
word_or_plural = re.escape(word) + 's?'
return re.match(word_or_plural, text)
You can use re.escape()
:
re.escape(string) Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
>>> import re
>>> re.escape('^a.*$')
'\\^a\\.\\*\\$'
If you are using a Python version < 3.7, this will escape non-alphanumerics that are not part of regular expression syntax as well.
If you are using a Python version < 3.7 but >= 3.3, this will escape non-alphanumerics that are not part of regular expression syntax, except for specifically underscore (_
).
Unfortunately, re.escape()
is not suited for the replacement string:
>>> re.sub('a', re.escape('_'), 'aa')
'\\_\\_'
A solution is to put the replacement in a lambda:
>>> re.sub('a', lambda _: '_', 'aa')
'__'
because the return value of the lambda is treated by re.sub()
as a literal string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With