Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find consecutive substring indexes

Given a search string and a result string (which is guaranteed to contain all letters of the search string, case-insensitive, in order), how can I most efficiently get an array of ranges representing the indices in the result string corresponding to the letters in the search string?

Desired output:

substrings( "word", "Microsoft Office Word 2007" )
#=> [ 17..20 ]

substrings( "word", "Network Setup Wizard" )
#=> [ 3..5, 19..19 ]
#=> [ 3..4, 18..19 ]   # Alternative, acceptable, less-desirable output

substrings( "word", "Watch Network Daemon" )
#=> [ 0..0, 10..11, 14..14 ]

This is for an autocomplete search box. Here's a screenshot from a tool similar to Quicksilver that underlines letters as I'm looking to do. Note that--unlike my ideal output above--this screenshot does not prefer longer single matches.
Screenshot of Colibri underlining letters in search results

Benchmark Results

Benchmarking the current working results shows that @tokland's regex-based answer is basically as fast as the StringScanner-based solutions I put forth, with less code:

               user     system      total        real
phrogz1    0.889000   0.062000   0.951000 (  0.944000)
phrogz2    0.920000   0.047000   0.967000 (  0.977000)
tokland    1.030000   0.000000   1.030000 (  1.035000)

Here is the benchmark test:

a=["Microsoft Office Word 2007","Network Setup Wizard","Watch Network Daemon"]
b=["FooBar","Foo Bar","For the Love of Big Cars"]
test = { a=>%w[ w wo wor word ], b=>%w[ f fo foo foobar fb fbr ] }
require 'benchmark'
Benchmark.bmbm do |x|
  %w[ phrogz1 phrogz2 tokland ].each{ |method|
    x.report(method){ test.each{ |words,terms|
      words.each{ |master| terms.each{ |term|
        2000.times{ send(method,term,master) }
      } }
    } }
  }
end
like image 263
Phrogz Avatar asked Apr 19 '11 15:04

Phrogz


3 Answers

To have something to start with, how about that?

>> s = "word"
>> re = /#{s.chars.map{|c| "(#{c})" }.join(".*?")}/i # /(w).*?(o).*?(r).*?(d)/i/
>> match = "Watch Network Daemon".match(re)
=> #<MatchData "Watch Network D" 1:"W" 2:"o" 3:"r" 4:"D">
>> 1.upto(s.length).map { |idx| match.begin(idx) }
=> [0, 10, 11, 14]

And now you only have to build the ranges (if you really need them, I guess the individual indexes are also ok).

like image 103
tokland Avatar answered Nov 14 '22 08:11

tokland


Ruby's Abbrev module is a good starting point. It breaks down a string into a hash consisting of the unique keys that can identify the full word:

require 'abbrev'
require 'pp'

abbr = Abbrev::abbrev(['ruby'])
>> {"rub"=>"ruby", "ru"=>"ruby", "r"=>"ruby", "ruby"=>"ruby"}

For every keypress you can do a lookup and see if there's a match. I'd filter out all keys shorter than a certain length, to reduce the size of the hash.

The keys will also give you a quick set of words to look up the subword matches in your original string.

For fast lookups to see if there's a substring match:

regexps = Regexp.union(
  abbr.keys.sort.reverse.map{ |k|
    Regexp.new(
      Regexp.escape(k),
      Regexp::IGNORECASE
    )
  }
)

Note that it's escaping the patterns, which would allow characters to be entered, such as ?, * or ., and be treated as literals, instead of special characters for regex, like they would normally be treated.

The result looks like:

/(?i-mx:ruby)|(?i-mx:rub)|(?i-mx:ru)|(?i-mx:r)/

Regexp's match will return information about what was found.

Because the union "ORs" the patterns, it will only find the first match, which will be the shortest occurrence in the string. To fix that reverse the sort.

That should give you a good start on what you want to do.


EDIT: Here's some code to directly answer the question. We've been busy at work so it's taken a couple days to get back this:

require 'abbrev'
require 'pp'

abbr = Abbrev::abbrev(['ruby'])
regexps = Regexp.union( abbr.keys.sort.reverse.map{ |k| Regexp.new( Regexp.escape(k), Regexp::IGNORECASE ) } )

target_str ='Ruby rocks, rub-a-dub-dub, RU there?'
str_offset = 0
offsets = []
loop do
  match_results = regexps.match(target_str, str_offset)
  break if (match_results.nil?)
  s, e = match_results.offset(0)
  offsets << [s, e - s]
  str_offset = 1 + s
end

pp offsets

>> [[0, 4], [5, 1], [12, 3], [27, 2], [33, 1]]

If you want ranges replace offsets << [s, e - s] with offsets << [s .. e] which will return:

>> [[0..4], [5..6], [12..15], [27..29], [33..34]]
like image 34
the Tin Man Avatar answered Nov 14 '22 08:11

the Tin Man


Here's a late entrant that's making a move as it nears the finish line.

code

def substrings( search_str, result_str )
  search_chars = search_str.downcase.chars
  next_char = search_chars.shift
  result_str.downcase.each_char.with_index.take_while.with_object([]) do |(c,i),a|
    if next_char == c
      (a.empty? || i != a.last.last+1) ? a << (i..i) : a[-1]=(a.last.first..i)
      next_char = search_chars.shift
    end   
    next_char
  end
end

demo

substrings( "word", "Microsoft Office Word 2007" ) #=> [17..20]
substrings( "word", "Network Setup Wizard" )       #=> [3..5, 19..19]
substrings( "word", "Watch Network Daemon" )       #=> [0..0, 10..11, 14..14]

benchmark

              user     system      total        real
phrogz1   1.120000   0.000000   1.120000 (  1.123083)
cary      0.550000   0.000000   0.550000 (  0.550728)
like image 29
Cary Swoveland Avatar answered Nov 14 '22 08:11

Cary Swoveland