I'm using Python 3. In my application, the use can input a regular expression string directly and the application will use it to match some strings. For example the user can type \t+
. However I can't make it work as I can't correctly convert it to a correct regular expression. I've tried and below is my code.
>>> import re
>>> re.compile(re.escape("\t+")).findall(" ")
[]
However when I change the regex string to \t
, it will work.
>>> re.compile(re.escape("\t")).findall(" ")
['\t']
Note the parameter to findall
IS a tab character. I don't know why it seems not correctly displayed in Stackoverflow.
Anyone can point me the right direction to solve this? Thanks.
I assume that the user input is a string, wherever it comes from your system:
user_input = input("Input regex:") # check console, it is expecting your input
print("User typed: '{}'. Input type: {}.".format(user_input, type(user_input)))
This means that you need to transform it to a regex, and that is what the re.compile
is for. If you use re.compile
and you don't provide a valid str
to be converted to a regex, it will throw an error.
Therefore, you can create a function to check if the input is valid or not. You used the re.escape
, so I added a flag to the function to use re.escape
or not.
def is_valid_regex(regex_from_user: str, escape: bool) -> bool:
try:
if escape: re.compile(re.escape(regex_from_user))
else: re.compile(regex_from_user)
is_valid = True
except re.error:
is_valid = False
return is_valid
print("If you don't use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=False)))
print("If you do use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=True)))
If your user input is: \t+
, you will get:
>> If you don't use re.escape, the input is valid: True.
>> If you do use re.escape, the input is valid: True.
However, if your user input is: [\t+
, you will get:
>> If you don't use re.escape, the input is valid: False.
>> If you do use re.escape, the input is valid: True.
Notice that it was indeed an invalid regex, however, by using re.escape
your regex becomes valid. That is because re.escape
escapes all your special characters, treating them as literal characters. So in the case that you have \t+
, if you use re.escape
you will be looking for a sequence of characters: \
, t
, +
and not for a tab character
.
Take the string you want to look into. For example, here is a string where the character between quotes is supposed to be a tab:
string_to_look_in = 'This is a string with a " " tab character.'
You can manually check for tabs by using the repr
function.
print(string_to_look_in)
print(repr(string_to_look_in))
>> This is a string with a " " tab character.
>> 'This is a string with a "\t" tab character.'
Notice that by using repr
the \t
representation of the tab character gets displayed.
Here is a script for you to try all these things:
import re
string_to_look_in = 'This is a string with a " " tab character.'
print("String to look into:", string_to_look_in)
print("String to look into:", repr(string_to_look_in), "\n")
user_input = input("Input regex:") # check console, it is expecting your input
print("\nUser typed: '{}'. Input type: {}.".format(user_input, type(user_input)))
def is_valid_regex(regex_from_user: str, escape: bool) -> bool:
try:
if escape: re.compile(re.escape(regex_from_user))
else: re.compile(regex_from_user)
is_valid = True
except re.error:
is_valid = False
return is_valid
print("\nIf you don't use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=False)))
print("If you do use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=True)))
if is_valid_regex(user_input, escape=False):
regex = re.compile(user_input)
print("\nRegex compiled as '{}' with type {}.".format(repr(regex), type(regex)))
matches = regex. findall(string_to_look_in)
print('Mathces found:', matches)
else:
print('\nThe regex was not valid, so no matches.')
The result of re.escape("\t+")
is '\\\t\\+'
. Note that the + sign is escaped with a backslash and is not a special character anymore. It does not mean "one or more tabs."
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With