Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex match fails with UTF-8 characters

I have a selenium/python project, which uses a regex match to find html elements. These element attributes sometime includes the danish/norwegian characters ÆØÅ. The problem is in this snippet below:

if (re.match(regexp_expression, compare_string)):
    result = True
else :
    result = False

Both the regex_expression and compare_string are manipulated before the regex match is executed. If i print them before the code snippet above is executed, and also print the result, I get the following output:

Regex_expression: [^log på$]
compare string: [log på]
result = false

I put brackets on to make sure that there were no whitespaces. They are only part of the print statement, and not part of the String variables.

If I however try to reproduce the problem in a seperate script, like this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

regexp_expression  = "^log på$"
compare_string = "log på"

if (re.match(regexp_expression, compare_string)):
    print("result true")
    result = True
else :
    print("result = false")
    result = False

Then the result is true.

How can this be? To make it even stranger, it worked earlier, and I am not sure what I edited, that made it go boom...

Full module of the regex compare method is here below. I have not coded this myself, so I am not a 100% familiar with the reason of all the replace statements, and String manipulation, but I would think it shouldn't matter, when I can check the Strings just before the failing match method in the bottom...

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

def regexp_compare(regexp_expression, compare_string):
    #final int DOTALL
    #try:    // include try catch for "PatternSyntaxException" while testing/including a new symbol in this method..

    #catch(PatternSyntaxException e):
    #    System.out.println("Regexp>>"+regexp_expression)
    #    e.printStackTrace()
    #*/


    if(not compare_string.strip() and (not regexp_expression.strip() or regexp_expression.strip().lower() == "*".lower()) or (regexp_expression.strip().lower() == ".*".lower())):
        print("return 1")
        return True                

    if(not compare_string or not regexp_expression):
        print("return 2")
        return False                

    regexp_expression = regexp_expression.lower()
    compare_string = compare_string.lower()

    if(not regexp_expression.strip()): 
        regexp_expression = ""

    if(not compare_string.strip() and (not regexp_expression.strip() or regexp_expression.strip().lower() == "*".lower()) or (regexp_expression.strip().lower() == ".*".lower())):
        regexp_expression = ""
    else:

        regexp_expression = regexp_expression.replace("\\","\\\\")
        regexp_expression = regexp_expression.replace("\\.","\\\\.")
        regexp_expression = regexp_expression.replace("\\*", ".*")
        regexp_expression = regexp_expression.replace("\\(", "\\\\(")
        regexp_expression = regexp_expression.replace("\\)", "\\\\)")           
        regexp_expression_arr = regexp_expression.split("|")
        regexp_expression = ""

        for i in range(0, len(regexp_expression_arr)):
            if(not(regexp_expression_arr[i].startswith("^"))):
                regexp_expression_arr[i] = "^"+regexp_expression_arr[i]

            if(not(regexp_expression_arr[i].endswith("$"))):
                regexp_expression_arr[i] = regexp_expression_arr[i]+"$"

            regexp_expression = regexp_expression_arr[i] if regexp_expression == "" else regexp_expression+"|"+regexp_expression_arr[i]  




    result = None        

    print("Regex_expression: [" + regexp_expression+"]")
    print("compare string: [" + compare_string+"]")

    if (re.match(regexp_expression, compare_string)):
        print("result true")
        result = True
    else :
        print("result = false")
        result = False

    print("return result")
    return result
like image 760
jumps4fun Avatar asked Jul 06 '15 11:07

jumps4fun


Video Answer


1 Answers

It's likely that your are comparing a unicode string to a non unicode string.

For example, in the following:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

regexp_expression  = "^log på$"
compare_string = u"log på"

if (re.match(regexp_expression, compare_string)):
    print("result true")
    result = True
else :
    print("result = false")
    result = False

You will get the output False. So there is likely a point in your manipulation where something is not unicode.

The same false will result with the following too:

regexp_expression  = u"^log på$"
compare_string = "log på"
like image 133
Coding Monkey Avatar answered Oct 04 '22 23:10

Coding Monkey