Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python user input as regular expression, how to do it correctly?

I'm using Python 3. In my application, the use can input a regular expression string directly and the application will use it to match some strings. For example the user can type \t+. However I can't make it work as I can't correctly convert it to a correct regular expression. I've tried and below is my code.

>>> import re
>>> re.compile(re.escape("\t+")).findall("  ")
[]

However when I change the regex string to \t, it will work.

>>> re.compile(re.escape("\t")).findall("   ")
['\t']

Note the parameter to findall IS a tab character. I don't know why it seems not correctly displayed in Stackoverflow.

Anyone can point me the right direction to solve this? Thanks.

like image 929
Just a learner Avatar asked Dec 24 '22 05:12

Just a learner


2 Answers

Compile user input

I assume that the user input is a string, wherever it comes from your system:

user_input = input("Input regex:")  # check console, it is expecting your input
print("User typed: '{}'. Input type: {}.".format(user_input, type(user_input)))

This means that you need to transform it to a regex, and that is what the re.compile is for. If you use re.compile and you don't provide a valid str to be converted to a regex, it will throw an error.

Therefore, you can create a function to check if the input is valid or not. You used the re.escape, so I added a flag to the function to use re.escape or not.

def is_valid_regex(regex_from_user: str, escape: bool) -> bool:
    try:
        if escape: re.compile(re.escape(regex_from_user))
        else: re.compile(regex_from_user)
        is_valid = True
    except re.error:
        is_valid = False
    return is_valid

print("If you don't use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=False)))
print("If you do use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=True)))

If your user input is: \t+, you will get:

>> If you don't use re.escape, the input is valid: True.
>> If you do use re.escape, the input is valid: True.

However, if your user input is: [\t+, you will get:

>> If you don't use re.escape, the input is valid: False.
>> If you do use re.escape, the input is valid: True.

Notice that it was indeed an invalid regex, however, by using re.escape your regex becomes valid. That is because re.escape escapes all your special characters, treating them as literal characters. So in the case that you have \t+, if you use re.escape you will be looking for a sequence of characters: \, t, + and not for a tab character.

Checking your lookup string

Take the string you want to look into. For example, here is a string where the character between quotes is supposed to be a tab:

string_to_look_in = 'This is a string with a "  " tab character.'

You can manually check for tabs by using the repr function.

print(string_to_look_in)
print(repr(string_to_look_in))
>> This is a string with a "    " tab character.
>> 'This is a string with a "\t" tab character.'

Notice that by using repr the \t representation of the tab character gets displayed.

Test script

Here is a script for you to try all these things:

import re

string_to_look_in = 'This is a string with a "  " tab character.'
print("String to look into:", string_to_look_in)
print("String to look into:", repr(string_to_look_in), "\n")

user_input = input("Input regex:")  # check console, it is expecting your input

print("\nUser typed: '{}'. Input type: {}.".format(user_input, type(user_input)))


def is_valid_regex(regex_from_user: str, escape: bool) -> bool:
    try:
        if escape: re.compile(re.escape(regex_from_user))
        else: re.compile(regex_from_user)
        is_valid = True
    except re.error:
        is_valid = False
    return is_valid

print("\nIf you don't use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=False)))
print("If you do use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=True)))

if is_valid_regex(user_input, escape=False):
    regex = re.compile(user_input)
    print("\nRegex compiled as '{}' with type {}.".format(repr(regex), type(regex)))

    matches = regex. findall(string_to_look_in)
    print('Mathces found:', matches)

else:
    print('\nThe regex was not valid, so no matches.')
like image 57
Bruno Lubascher Avatar answered Dec 25 '22 20:12

Bruno Lubascher


The result of re.escape("\t+") is '\\\t\\+'. Note that the + sign is escaped with a backslash and is not a special character anymore. It does not mean "one or more tabs."

like image 45
DYZ Avatar answered Dec 25 '22 18:12

DYZ