Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

elegant way to match two wildcarded strings

I'm OCRing some text from two different sources. They can each make mistakes in different places, where they won't recognize a letter/group of letters. If they don't recognize something, it's replaced with a ?. For example, if the word is Roflcopter, one source might return Ro?copter, while another, Roflcop?er. I'd like a function that returns whether two matches might be equivalent, allowing for multiple ?s. Example:

match("Ro?copter", "Roflcop?er") --> True
match("Ro?copter", "Roflcopter") --> True
match("Roflcopter", "Roflcop?er") --> True
match("Ro?co?er", "Roflcop?er") --> True

So far I can match one OCR with a perfect one by using regular expressions:

>>> def match(tn1, tn2):
    tn1re = tn1.replace("?", ".{0,4}")
    tn2re = tn2.replace("?", ".{0,4}")

    return bool(re.match(tn1re, tn2) or re.match(tn2re, tn1))

>>> match("Roflcopter", "Roflcop?er")
True
>>> match("R??lcopter", "Roflcopter")
True

But this doesn't work when they both have ?s in different places:

>>> match("R??lcopter", "Roflcop?er")
False
like image 246
Claudiu Avatar asked Oct 14 '10 18:10

Claudiu


People also ask

What are the 2 commonly used wildcards?

Commonly used wildcards are the asterisk ( * ) and the question mark ( ? ).

What are the two special wildcard characters used in pattern matching?

The wildcard pattern contains the following two special characters: '? ': It represents any single character. '*': It represents 0 or any sequence of characters.

What is wildcard pattern matching?

A wildcard pattern is a series of characters that are matched against incoming character strings. You can use these patterns when you define pattern matching criteria. Matching is done strictly from left to right, one character or basic wildcard pattern at a time.

What is the wildcard that matches one or more occurrences of a pattern in MS word Search?

A wildcard can replace one or more characters in a string of text or numbers. The most common wildcard is the asterisk (*). It's important to note that wildcard searches are case sensitive. Also, Word uses "lazy" pattern matching so it will stop matching as soon as possible.


2 Answers

Well, as long as one ? corresponds to one character, then I can suggest a performant and a compact enough method.

def match(str1, str2):
    if len(str1) != len(str2): return False
    for index, ch1 in enumerate(str1):
        ch2 = str2[index]
        if ch1 == '?' or ch2 == '?': continue
        if ch1 != ch2: return False
    return True

>>> ================================ RESTART ================================
>>> 
>>> match("Roflcopter", "Roflcop?er")
True
>>> match("R??lcopter", "Roflcopter")
True
>>> 
>>> match("R??lcopter", "Roflcop?er")
True
>>> 

Edit: Part B), brain-fart free now.

def sets_match(set1, set2):
    return any(match(str1, str2) for str1 in set1 for str2 in set2)

>>> ================================ RESTART ================================
>>> 
>>> s1 = set(['a?', 'fg'])
>>> s2 = set(['?x'])
>>> sets_match(s1, s2) # a? = x?
True
>>> 
like image 130
Hamish Grubijan Avatar answered Sep 16 '22 11:09

Hamish Grubijan


Thanks to Hamish Grubijan for this idea. Every ? in my ocr'd names can be anywhere from 0 to 3 letters. What I do is expand each string to a list of possible expansions:

>>> list(expQuestions("?flcopt?"))
['flcopt', 'flcopt@', 'flcopt@@', 'flcopt@@@', '@flcopt', '@flcopt@', '@flcopt@@', '@flcopt@@@', '@@flcopt', '@@flcopt@', '@@flcopt@@', '@@flcopt@@@', '@@@flcopt', '@@@flcopt@', '@@@flcopt@@', '@@@flcopt@@@']

then I expand both and use his matching function, which I called matchats:

def matchOCR(l, r):
    for expl in expQuestions(l):
        for expr in expQuestions(r):
            if matchats(expl, expr):
                return True
    return False

Works as desired:

>>> matchOCR("Ro?co?er", "?flcopt?")
True
>>> matchOCR("Ro?co?er", "?flcopt?z")
False
>>> matchOCR("Ro?co?er", "?flc?pt?")
True
>>> matchOCR("Ro?co?e?", "?flc?pt?")
True


The matching function:
def matchats(l, r):
    """Match two strings with @ representing exactly 1 char"""
    if len(l) != len(r): return False
    for i, c1 in enumerate(l):
        c2 = r[i]
        if c1 == "@" or c2 == "@": continue
        if c1 != c2: return False
    return True

and the expanding function, where cartesian_product does just that:

def expQuestions(s):
    """For OCR w/ a questionmark in them, expand questions with
    @s for all possibilities"""
    numqs = s.count("?")

    blah = list(s)
    for expqs in cartesian_product([(0,1,2,3)]*numqs):
        newblah = blah[:]
        qi = 0
        for i,c in enumerate(newblah):
            if newblah[i] == '?':
                newblah[i] = '@'*expqs[qi]
                qi += 1
        yield "".join(newblah)
like image 25
Claudiu Avatar answered Sep 17 '22 11:09

Claudiu