Find best substring match

Tags:

I'm looking for a library or a method using existing libraries( difflib, fuzzywuzzy, python-levenshtein) to find the closest match of a string (query) in a text (corpus)

I've developped a method based on difflib, where I split my corpus into ngrams of size n (length of query).

import difflib
from nltk.util import ngrams

def get_best_match(query, corpus):
    ngs = ngrams( list(corpus), len(query) )
    ngrams_text = [''.join(x) for x in ngs]
    return difflib.get_close_matches(query, ngrams_text, n=1, cutoff=0)

it works as I want when the difference between the query and the matched string are just character replacements.

query = "ipsum dolor"
corpus = "lorem 1psum d0l0r sit amet"

match = get_best_match(query, corpus)
# match = "1psum d0l0r"

But when the difference is character deletion, it is not.

query = "ipsum dolor"
corpus = "lorem 1psum dlr sit amet"

match = get_best_match(query, corpus)
# match = "psum dlr si"
# expected_match = "1psum dlr"

Is there a way to get a more flexible result size ( as for expected_match ) ?

EDIT 1:

The actual use of this script is to match queries (strings) with a messy ocr output.
As I said in the question, the ocr can confound characters, and even miss them.
If possible consider also the case when a space is missing between words.
A best match, is the one that does not include characters from other words than those on the query.

EDIT 2:

The solution I use now is to extend the ngrams with (n-k)-grams for k = {1,2,3} to prevent 3 deletions. It's much better than the first version, but not efficient in terms of speed, as we have more than 3 times the number of ngrams to check. It is also a non generalizable solution.

518

asked Mar 15 '16 13:03

Ghilas BELHADJ

1 Answers

This function finds best matching substring of variable length.

The implementation considers the corpus as one long string, hence avoiding your concerns with spaces and unseparated words.

Code summary: 1. Scan the corpus for match values in steps of size step to find the approximate location of highest match value, pos. 2. Find the substring in the vicinity of pos with the highest match value, by adjusting the left/right positions of the substring.

from difflib import SequenceMatcher

def get_best_match(query, corpus, step=4, flex=3, case_sensitive=False, verbose=False):
    """Return best matching substring of corpus.

    Parameters
    ----------
    query : str
    corpus : str
    step : int
        Step size of first match-value scan through corpus. Can be thought of
        as a sort of "scan resolution". Should not exceed length of query.
    flex : int
        Max. left/right substring position adjustment value. Should not
        exceed length of query / 2.

    Outputs
    -------
    output0 : str
        Best matching substring.
    output1 : float
        Match ratio of best matching substring. 1 is perfect match.
    """

    def _match(a, b):
        """Compact alias for SequenceMatcher."""
        return SequenceMatcher(None, a, b).ratio()

    def scan_corpus(step):
        """Return list of match values from corpus-wide scan."""
        match_values = []

        m = 0
        while m + qlen - step <= len(corpus):
            match_values.append(_match(query, corpus[m : m-1+qlen]))
            if verbose:
                print(query, "-", corpus[m: m + qlen], _match(query, corpus[m: m + qlen]))
            m += step

        return match_values

    def index_max(v):
        """Return index of max value."""
        return max(range(len(v)), key=v.__getitem__)

    def adjust_left_right_positions():
        """Return left/right positions for best string match."""
        # bp_* is synonym for 'Best Position Left/Right' and are adjusted 
        # to optimize bmv_*
        p_l, bp_l = [pos] * 2
        p_r, bp_r = [pos + qlen] * 2

        # bmv_* are declared here in case they are untouched in optimization
        bmv_l = match_values[p_l // step]
        bmv_r = match_values[p_l // step]

        for f in range(flex):
            ll = _match(query, corpus[p_l - f: p_r])
            if ll > bmv_l:
                bmv_l = ll
                bp_l = p_l - f

            lr = _match(query, corpus[p_l + f: p_r])
            if lr > bmv_l:
                bmv_l = lr
                bp_l = p_l + f

            rl = _match(query, corpus[p_l: p_r - f])
            if rl > bmv_r:
                bmv_r = rl
                bp_r = p_r - f

            rr = _match(query, corpus[p_l: p_r + f])
            if rr > bmv_r:
                bmv_r = rr
                bp_r = p_r + f

            if verbose:
                print("\n" + str(f))
                print("ll: -- value: %f -- snippet: %s" % (ll, corpus[p_l - f: p_r]))
                print("lr: -- value: %f -- snippet: %s" % (lr, corpus[p_l + f: p_r]))
                print("rl: -- value: %f -- snippet: %s" % (rl, corpus[p_l: p_r - f]))
                print("rr: -- value: %f -- snippet: %s" % (rl, corpus[p_l: p_r + f]))

        return bp_l, bp_r, _match(query, corpus[bp_l : bp_r])

    if not case_sensitive:
        query = query.lower()
        corpus = corpus.lower()

    qlen = len(query)

    if flex >= qlen/2:
        print("Warning: flex exceeds length of query / 2. Setting to default.")
        flex = 3

    match_values = scan_corpus(step)
    pos = index_max(match_values) * step

    pos_left, pos_right, match_value = adjust_left_right_positions()

    return corpus[pos_left: pos_right].strip(), match_value

Example:

query = "ipsum dolor"
corpus = "lorem i psum d0l0r sit amet"
match = get_best_match(query, corpus, step=2, flex=4)
print(match)
('i psum d0l0r', 0.782608695652174)

Some good heuristic advice is to always keep step < len(query) * 3/4, and flex < len(query) / 3. I also added case sensitivity, in case that's important. It works quite well when you start playing with the step and flex values. Small step values gives better results but takes longer to compute. flex governs how flexible the length of the resulting substring is allowed to be.

Important to note: This will only find the first best match, so if there are multiple equally good matches, only the first will be returned. To allow for multiple matches, change index_max() to return a list of indices for the n highest values of the input list, and loop over adjust_left_right_positions() for values in that list.

130

answered Oct 18 '22 04:10

Ulf Aslak

Related questions
                            
                                How to update redis after updating database?
                            
                                PHP MySQL over SSL. Peer certificate did not match
                            
                                How to make CORS-enabled HTTP requests in Angular 2?
                            
                                Paypal payment : How to get success request when loading the paypal in webview
                            
                                Prevent react-router history.push from reloading current route
                            
                                How to detect if a func. is a constexpr? and mark other func. constexpr depending on it?
                            
                                TortoiseSVN error The XML response contains invalid XML and Malformed XML: no element found
                            
                                How does the Shouldly assertion library know the expression the assertion was applied to?
                            
                                What's the difference between a reversed tuple and a reversed list?
                            
                                Custom input element in native form
                            
                                Invalid top-level type in JSON write'
                            
                                Unclear curl command -o-

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With