Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I do fuzzy substring matching in Ruby?

I found lots of links about fuzzy matching, comparing one string to another and seeing which gets the highest similarity score.

I have one very long string, which is a document, and a substring. The substring came from the original document, but has been converted several times, so weird artifacts might have been introduced, such as a space here, a dash there. The substring will match a section of the text in the original document 99% or more. I am not matching to see from which document this string is, I am trying to find the index in the document where the string starts.

If the string was identical because no random error was introduced, I would use document.index(substring), however this fails if there is even one character difference.

I thought the difference would be accounted for by removing all characters except a-z in both the string and the substring, compare, and then use the index I generated when compressing the string to translate the index in the compressed string to the index in the real document. This worked well where the difference was whitespace and punctuation, but as soon as one letter is different it failed.

The document is typically a few pages to a hundred pages, and the substring from a few sentences to a few pages.

like image 534
Stian Håklev Avatar asked May 23 '11 06:05

Stian Håklev


People also ask

What is fuzzy matching example?

Fuzzy Matching (also called Approximate String Matching) is a technique that helps identify two elements of text, strings, or entries that are approximately similar but are not exactly the same. For example, let's take the case of hotels listing in New York as shown by Expedia and Priceline in the graphic below.

What is fuzzy data matching?

Fuzzy matching (FM), also known as fuzzy logic, approximate string matching, fuzzy name matching, or fuzzy string matching is an artificial intelligence and machine learning technology that identifies similar, but not identical elements in data table sets.

What is fuzzy RegEx?

Fuzzy RegEx (also referred to as "fuzzy matching" or "fuzzy mode" or even just "fuzzy") allows regular expression patterns to match text within a set percentage of similarity. This can allow Grooper users to overcome unpredictable OCR errors when extracting data from documents.

How long is fuzzy matching?

From 3.7 hours to 0.2 seconds. How to perform intelligent string matching in a way that can scale to even the biggest data sets. Same but different. Fuzzy matching of data is an essential first-step for a huge range of data science workflows.


2 Answers

You could try amatch. It's available as a ruby gem and, although I haven't worked with fuzzy logic for a long time, it looks to have what you need. The homepage for amatch is: http://flori.github.com/amatch/.

Just bored and messing around with the idea, a completely non-optimized and untested hack of a solution follows:

include 'amatch'

module FuzzyFinder
  def scanner( input )
    out = [] unless block_given?
    pos = 0
    input.scan(/(\w+)(\W*)/) do |word, white|
      startpos = pos
      pos = word.length + white.length
      if block_given?
        yield startpos, word
      else
        out << [startpos, word]
      end
    end
  end

  def find( text, doc )
    index = scanner(doc)
    sstr = text.gsub(/\W/,'')
    levenshtein = Amatch::Levensthtein.new(sstr)
    minlen = sstr.length
    maxndx = index.length
    possibles = []
    minscore = minlen*2
    index.each_with_index do |x, i|
      spos = x[0]
      str = x[1]
      si = i
      while (str.length < minlen)
        i += 1
        break unless i < maxndx
        str += index[i][1]
      end
      str = str.slice(0,minlen) if (str.length > minlen)
      score = levenshtein.search(str)
      if score < minscore
        possibles = [spos]
        minscore = score
      elsif score == minscore
        possibles << spos
      end
    end
    [minscore, possibles]
  end
end

Obviously there are numerous improvements possible and probably necessary! A few off the top:

  1. Process the document once and store the results, possibly in a database.
  2. Determine a usable length of string for an initial check, process against that initial substring first before trying to match the entire fragment.
  3. Following up on the previous, precalculate starting fragments of that length.
like image 158
Brian61 Avatar answered Sep 18 '22 18:09

Brian61


A simple one is fuzzy_match

require 'fuzzy_match'
FuzzyMatch.new(['seamus', 'andy', 'ben']).find('Shamus') #=> seamus

A more elaborated one (you wouldn't say it from this example though) is levenshein, which computes the number of differences.

require 'levenshtein' 
Levenshtein.distance('test', 'test')    # => 0
Levenshtein.distance('test', 'tent')    # => 1
like image 41
peter Avatar answered Sep 18 '22 18:09

peter