Efficient way (in Ruby), to determine largest matching sequence in an array/string?

Question

Lets's say I have two arrays of words:

array1 = ["hello", "world", "i", "am", "in", "the", "world"]
array2 = ["This", "is", "the", "hello", "world", "message"]

which, could just as easily be represented by two strings:

string1 = "hello world i am in the world"
string2 = "This is the hello world message"

Lets assume I'm going with arrays for now.

I want to find the largest sub-array of array2 which appears in the same order in array1.

So, if you were going to do it in the slowest way imaginable, say, you would say:

get all the 6-word subarrays (of which there is one) from array2.
- does it appear in array1, in that order? NO
get all the 5-word subarrays (of which there are two) from array2.
- do either of them appear in array1, in that order? NO
get all the 4-word subarrays from array2.
- do any of them appear in array1, in that order? NO
etc, until we get to
get all the 2-word subarrays from array2.
- do any of them appear in array1, in that order? YES: ["hello", "world"] does. STOP.

But, this feels like it would be quite inefficient. Can anyone see a better method? I'm using Ruby, but i'd be interested in a general algorithm as well as how to do it in that specific language.

Note that this isn't just array intersection because that (at least in ruby) doesn't care about the order of elements, which I do care about.

thanks!

SRack · Accepted Answer

Here's a fast working solution, cutting down the comparison to only those element in common across arrays:

array1 = ["hello", "world", "i", "am", "in", "the", "world"]
array2 = ["This", "is", "the", "hello", "world", "message"]

common_words = array1 & array2

stringified_array1 = array1.join(' ')
stringified_array2 = array2.join(' ')

(common_words.length - 1).downto(0).map do |n|
  stringified_combo = array1[0..n].join(' ')

  if stringified_array1.include?(stringified_combo) && stringified_array2.include?(stringified_combo)
    stringified_combo.split($,)
  end 
end.compact.max

This you get the words in common between the two arrays, and tests these from largest to smallest. You check they're in order in the first array, then if they exist in the second.

I believe this holds up and is pretty efficient, though happy to receive any comments and feedback,

Dustin Harmon · Answer

This seems to be a take on the "longest common substring" problem, but using words instead of characters in strings.

This wiki (https://en.wikipedia.org/wiki/Longest_common_substring_problem) outlines the dynamic programming approach to locate the largest common match in pseudo code and could be ported over to ruby passing the two arrays as the params.

function LCSubstr(S[1..r], T[1..n])
L := array(1..r, 1..n)
z := 0
ret := {}
for i := 1..r
    for j := 1..n
        if S[i] == T[j]
            if i == 1 or j == 1
                L[i,j] := 1
            else
                L[i,j] := L[i-1,j-1] + 1
            if L[i,j] > z
                z := L[i,j]
                ret := {S[i-z+1..i]}
            else
            if L[i,j] == z
                ret := ret ∪ {S[i-z+1..i]}
        else
            L[i,j] := 0
return ret

Tony Hopkinson · Answer

Do the "six word" test straight off Then I'd the go through each word in the second array and test to see if it's in the 1st. If it is then look for it and the one after, if both then look for the next.

ie once you've discovered "This" is not present in the first array, you've also discarded the five other potential candidates starting with this.

Cary Swoveland · Answer

Until I learned about the "longest common substring/sequence problem" (see @Dustin's answer) I did not think there is an approach that is better than the one you outlined in the question: start with the largest possible subarray (array2 itself), then sequentially reduce the size of subarrays by one until a match is found (or it is determined that the two arrays contain no common element). Though I now see there is more efficient way, your idea is certainly not a bad one, particularly if the substrings are not too large, and it is easier to implement than the dynamic programming solution Dustin referenced. I've implemented your idea below.

I have chosen to use a regular expression to identify matches, so I need to convert array1 to a string.

str1 = array1.join(' ')
  #=> "hello world i am in the world"

The calculation is as follows.

[array1.size, array2.size].min.downto(1).each do |n|
  a = array2.each_cons(n).find { |a| str1.match?(/\b#{a.join(' ')}\b/) }
  break a unless a.nil? 
end
  #=> ["hello", "world"]

nil is returned if array1 and array2 have no common element. If desired one could first test (array1 & array2).empty?.

Here is a possible improvement to what I have above. The idea is to attempt to reduce m in m.downto(1).

h1 = array1.each_with_object(Hash.new(0)) { |word, h| h[word] += 1 }
  #=> {"hello"=>1, "world"=>2, "i"=>1, "am"=>1, "in"=>1, "the"=>1} 
h2 = array1.each_with_object(Hash.new(0)) { |word, h| h[word] += 1 }
  #=> {"hello"=>1, "world"=>2, "i"=>1, "am"=>1, "in"=>1, "the"=>1} 
m = [array1.size, array2.size, h2.sum { |k,v| [v, h2[k]].min }].min
  #=> [7, 6, 7].min
  #=> 6

This does not help here, but it might if the arrays array1 and array2 were different.

seph · Answer

This was part of an implementation in Ruby of similar_text from PHP. Using strings:

def substrings(str)
  (0...str.size).flat_map do |i|
    (i...str.size).map { |j| str[i..j] }
  end
end

def lcs(str1, str2)
  (substrings(str1) & substrings(str2)).max_by(&:size)
end

puts "'#{lcs("hello world i am in the world", "This is the hello world message")}'"

=> 'hello world '

Brute force for the substrings might be a candidate for a Rust FFI call? We weren't doing very big compares so it worked for us.

Efficient way (in Ruby), to determine largest matching sequence in an array/string?

Tags:

arrays

algorithm

ruby

match

Max Williams

5 Answers

SRack

Dustin Harmon

Tony Hopkinson

Cary Swoveland

seph

Recent Activity

Donate For Us

Efficient way (in Ruby), to determine largest matching sequence in an array/string?

Tags:

arrays

algorithm

ruby

match

Max Williams

5 Answers

SRack

Dustin Harmon

Tony Hopkinson

Cary Swoveland

seph

Related questions

Recent Activity

Donate For Us