Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find a position of a substring within a string with fuzzy match

I have come across a problem of matching a string in an OCR recognized text and find the position of it considering there can be arbitrary tolerance of wrong, missing or extra characters. The result should be a best match position, possibly (not necessarily) with length of matching substring.

For example:

String: 9912, 1.What is your name?
Substring: 1. What is your name?
Tolerance: 1
Result: match on character 7

String: Where is our caat if any?
Substring: your cat
Tolerance: 2
Result: match on character 10

String: Tolerance is t0o h1gh.
Substring: Tolerance is too high;
Tolerance: 1
Result: no match

I have tried to adapt Levenstein algorithm, but it doesn't work properly for substrings and doesn't return position.

Algorithm in Delphi would be preferred, yet any implementation or pseudo logic would do.

like image 900
too Avatar asked Dec 07 '10 10:12

too


People also ask

How do you evaluate a fuzzy match?

One of the most effective ways to calculate scores for a fuzzy string matching algorithm is by using cosine similarity. The cosine similarity between two non-zero vectors is simply the cosine of the angle between these vectors.

What is Fuzzy Wuzzy matching?

FuzzyWuzzy is a library of Python which is used for string matching. Fuzzy string matching is the process of finding strings that match a given pattern. Basically it uses Levenshtein Distance to calculate the differences between sequences.

How do you do a fuzzy search?

A fuzzy search searches for text that matches a term closely instead of exactly. Fuzzy searches help you find relevant results even when the search terms are misspelled. To perform a fuzzy search, append a tilde (~) at the end of the search term.

What is the difference between exact matching and fuzzy matching?

You can use the exact matching method for almost any field, including custom fields. The fuzzy matching methods look for strings that approximately match a pattern. Some fuzzy matching methods, such as Acronym and Name Variant, identify similarities using hard-coded dictionaries.


1 Answers

Here's a recursive implementation that works, but might not be fast enough. The worst case scenario is when a match can't be found, and all but the last char in "What" gets matched at every index in Where. In that case the algorithm will make Length(What)-1 + Tolerance comparasions for each char in Where, plus one recursive call per Tolerance. Since both Tolerance and the length of What are constnats, I'd say the algorithm is O(n). It's performance will degrade linearly with the length of both "What" and "Where".

function BrouteFindFirst(What, Where:string; Tolerance:Integer; out AtIndex, OfLength:Integer):Boolean;
  var i:Integer;
      aLen:Integer;
      WhatLen, WhereLen:Integer;

    function BrouteCompare(wherePos, whatPos, Tolerance:Integer; out Len:Integer):Boolean;
    var aLen:Integer;
        aRecursiveLen:Integer;
    begin
      // Skip perfect match characters
      aLen := 0;
      while (whatPos <= WhatLen) and (wherePos <= WhereLen) and (What[whatPos] = Where[wherePos]) do
      begin
        Inc(aLen);
        Inc(wherePos);
        Inc(whatPos);
      end;
      // Did we find a match?
      if (whatPos > WhatLen) then
        begin
          Result := True;
          Len := aLen;
        end
      else if Tolerance = 0 then
        Result := False // No match and no more "wild cards"
      else
        begin
          // We'll make an recursive call to BrouteCompare, allowing for some tolerance in the string
          // matching algorithm.
          Dec(Tolerance); // use up one "wildcard"
          Inc(whatPos); // consider the current char matched
          if BrouteCompare(wherePos, whatPos, Tolerance, aRecursiveLen) then
            begin
              Len := aLen + aRecursiveLen;
              Result := True;
            end
          else if BrouteCompare(wherePos + 1, whatPos, Tolerance, aRecursiveLen) then
            begin
              Len := aLen + aRecursiveLen;
              Result := True;
            end
          else
            Result := False; // no luck!
        end;
    end;

  begin

    WhatLen := Length(What);
    WhereLen := Length(Where);

    for i:=1 to Length(Where) do
    begin
      if BrouteCompare(i, 1, Tolerance, aLen) then
      begin
        AtIndex := i;
        OfLength := aLen;
        Result := True;
        Exit;
      end;
    end;

    // No match found!
    Result := False;

  end;

I've used the following code to test the function:

procedure TForm18.Button1Click(Sender: TObject);
var AtIndex, OfLength:Integer;
begin
  if BrouteFindFirst(Edit2.Text, Edit1.Text, ComboBox1.ItemIndex, AtIndex, OfLength) then
    Label3.Caption := 'Found @' + IntToStr(AtIndex) + ', of length ' + IntToStr(OfLength)
  else
    Label3.Caption := 'Not found';
end;

For case:

String: Where is our caat if any?
Substring: your cat
Tolerance: 2
Result: match on character 10

it shows a match on character 9, of length 6. For the other two examples it gives the expected result.

like image 200
Cosmin Prund Avatar answered Sep 19 '22 05:09

Cosmin Prund