How can I do fuzzy substring matching in Ruby?

Tags:

I found lots of links about fuzzy matching, comparing one string to another and seeing which gets the highest similarity score.

I have one very long string, which is a document, and a substring. The substring came from the original document, but has been converted several times, so weird artifacts might have been introduced, such as a space here, a dash there. The substring will match a section of the text in the original document 99% or more. I am not matching to see from which document this string is, I am trying to find the index in the document where the string starts.

If the string was identical because no random error was introduced, I would use document.index(substring), however this fails if there is even one character difference.

I thought the difference would be accounted for by removing all characters except a-z in both the string and the substring, compare, and then use the index I generated when compressing the string to translate the index in the compressed string to the index in the real document. This worked well where the difference was whitespace and punctuation, but as soon as one letter is different it failed.

The document is typically a few pages to a hundred pages, and the substring from a few sentences to a few pages.

534

asked May 23 '11 06:05

Stian Håklev

2 Answers

You could try amatch. It's available as a ruby gem and, although I haven't worked with fuzzy logic for a long time, it looks to have what you need. The homepage for amatch is: http://flori.github.com/amatch/.

Just bored and messing around with the idea, a completely non-optimized and untested hack of a solution follows:

include 'amatch'

module FuzzyFinder
  def scanner( input )
    out = [] unless block_given?
    pos = 0
    input.scan(/(\w+)(\W*)/) do |word, white|
      startpos = pos
      pos = word.length + white.length
      if block_given?
        yield startpos, word
      else
        out << [startpos, word]
      end
    end
  end

  def find( text, doc )
    index = scanner(doc)
    sstr = text.gsub(/\W/,'')
    levenshtein = Amatch::Levensthtein.new(sstr)
    minlen = sstr.length
    maxndx = index.length
    possibles = []
    minscore = minlen*2
    index.each_with_index do |x, i|
      spos = x[0]
      str = x[1]
      si = i
      while (str.length < minlen)
        i += 1
        break unless i < maxndx
        str += index[i][1]
      end
      str = str.slice(0,minlen) if (str.length > minlen)
      score = levenshtein.search(str)
      if score < minscore
        possibles = [spos]
        minscore = score
      elsif score == minscore
        possibles << spos
      end
    end
    [minscore, possibles]
  end
end

Obviously there are numerous improvements possible and probably necessary! A few off the top:

Process the document once and store the results, possibly in a database.
Determine a usable length of string for an initial check, process against that initial substring first before trying to match the entire fragment.
Following up on the previous, precalculate starting fragments of that length.

158

answered Sep 18 '22 18:09

Brian61

A simple one is fuzzy_match

require 'fuzzy_match'
FuzzyMatch.new(['seamus', 'andy', 'ben']).find('Shamus') #=> seamus

A more elaborated one (you wouldn't say it from this example though) is levenshein, which computes the number of differences.

require 'levenshtein' 
Levenshtein.distance('test', 'test')    # => 0
Levenshtein.distance('test', 'tent')    # => 1

answered Sep 18 '22 18:09

peter

Related questions
                            
                                Creating and iterating a 2d array in Ruby
                            
                                1 to 100 odd numbers in array
                            
                                Rails 3. How to perform a save action on all records?
                            
                                What's the best way to test this?
                            
                                Ruby find next in array
                            
                                A way to round Floats down
                            
                                Why is Enumerable#each_with_object deprecated?
                            
                                Python, Ruby, Haskell - Do they provide true multithreading?
                            
                                Trim a trailing .0
                            
                                Why am I getting intermittent Excon::Error::Socket: getaddrinfo: No address associated with hostname (SocketError)?
                            
                                Rails Notification Message plugin?
                            
                                custom rails has_many association ( through pg array )
                            
                                Sessions are getting crossed. Ruby on Rails
                            
                                Markdown to plain text in Ruby?
                            
                                Need advice on MongoDB Schema for Chat App. Embedded vs Related Documents
                            
                                Sidekiq - view completed jobs
                            
                                How can I modify .xfdl files? (Update #1)
                            
                                Why is == faster than eql?
                            
                                How does bundler work (in general)?
                            
                                How to extend ActiveRecord::Migration with additional methods?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I do fuzzy substring matching in Ruby?

Tags:

string

ruby

fuzzy-search

Stian Håklev

People also ask

2 Answers

Brian61

peter

Recent Activity

Donate For Us