Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way (in Ruby), to determine largest matching sequence in an array/string?

Lets's say I have two arrays of words:

array1 = ["hello", "world", "i", "am", "in", "the", "world"]
array2 = ["This", "is", "the", "hello", "world", "message"]

which, could just as easily be represented by two strings:

string1 = "hello world i am in the world"
string2 = "This is the hello world message"

Lets assume I'm going with arrays for now.

I want to find the largest sub-array of array2 which appears in the same order in array1.

So, if you were going to do it in the slowest way imaginable, say, you would say:

  • get all the 6-word subarrays (of which there is one) from array2.
    • does it appear in array1, in that order? NO
  • get all the 5-word subarrays (of which there are two) from array2.
    • do either of them appear in array1, in that order? NO
  • get all the 4-word subarrays from array2.
    • do any of them appear in array1, in that order? NO
  • etc, until we get to
  • get all the 2-word subarrays from array2.
    • do any of them appear in array1, in that order? YES: ["hello", "world"] does. STOP.

But, this feels like it would be quite inefficient. Can anyone see a better method? I'm using Ruby, but i'd be interested in a general algorithm as well as how to do it in that specific language.

Note that this isn't just array intersection because that (at least in ruby) doesn't care about the order of elements, which I do care about.

thanks!

like image 435
Max Williams Avatar asked Mar 13 '19 16:03

Max Williams


5 Answers

Here's a fast working solution, cutting down the comparison to only those element in common across arrays:

array1 = ["hello", "world", "i", "am", "in", "the", "world"]
array2 = ["This", "is", "the", "hello", "world", "message"]

common_words = array1 & array2

stringified_array1 = array1.join(' ')
stringified_array2 = array2.join(' ')

(common_words.length - 1).downto(0).map do |n|
  stringified_combo = array1[0..n].join(' ')

  if stringified_array1.include?(stringified_combo) && stringified_array2.include?(stringified_combo)
    stringified_combo.split($,)
  end 
end.compact.max

This you get the words in common between the two arrays, and tests these from largest to smallest. You check they're in order in the first array, then if they exist in the second.

I believe this holds up and is pretty efficient, though happy to receive any comments and feedback,

like image 194
SRack Avatar answered Oct 06 '22 07:10

SRack


This seems to be a take on the "longest common substring" problem, but using words instead of characters in strings.

This wiki (https://en.wikipedia.org/wiki/Longest_common_substring_problem) outlines the dynamic programming approach to locate the largest common match in pseudo code and could be ported over to ruby passing the two arrays as the params.

function LCSubstr(S[1..r], T[1..n])
L := array(1..r, 1..n)
z := 0
ret := {}
for i := 1..r
    for j := 1..n
        if S[i] == T[j]
            if i == 1 or j == 1
                L[i,j] := 1
            else
                L[i,j] := L[i-1,j-1] + 1
            if L[i,j] > z
                z := L[i,j]
                ret := {S[i-z+1..i]}
            else
            if L[i,j] == z
                ret := ret ∪ {S[i-z+1..i]}
        else
            L[i,j] := 0
return ret
like image 23
Dustin Harmon Avatar answered Oct 06 '22 05:10

Dustin Harmon


Do the "six word" test straight off Then I'd the go through each word in the second array and test to see if it's in the 1st. If it is then look for it and the one after, if both then look for the next.

ie once you've discovered "This" is not present in the first array, you've also discarded the five other potential candidates starting with this.

like image 45
Tony Hopkinson Avatar answered Oct 06 '22 07:10

Tony Hopkinson


Until I learned about the "longest common substring/sequence problem" (see @Dustin's answer) I did not think there is an approach that is better than the one you outlined in the question: start with the largest possible subarray (array2 itself), then sequentially reduce the size of subarrays by one until a match is found (or it is determined that the two arrays contain no common element). Though I now see there is more efficient way, your idea is certainly not a bad one, particularly if the substrings are not too large, and it is easier to implement than the dynamic programming solution Dustin referenced. I've implemented your idea below.

I have chosen to use a regular expression to identify matches, so I need to convert array1 to a string.

str1 = array1.join(' ')
  #=> "hello world i am in the world" 

The calculation is as follows.

[array1.size, array2.size].min.downto(1).each do |n|
  a = array2.each_cons(n).find { |a| str1.match?(/\b#{a.join(' ')}\b/) }
  break a unless a.nil? 
end
  #=> ["hello", "world"] 

nil is returned if array1 and array2 have no common element. If desired one could first test (array1 & array2).empty?.

Here is a possible improvement to what I have above. The idea is to attempt to reduce m in m.downto(1).

h1 = array1.each_with_object(Hash.new(0)) { |word, h| h[word] += 1 }
  #=> {"hello"=>1, "world"=>2, "i"=>1, "am"=>1, "in"=>1, "the"=>1} 
h2 = array1.each_with_object(Hash.new(0)) { |word, h| h[word] += 1 }
  #=> {"hello"=>1, "world"=>2, "i"=>1, "am"=>1, "in"=>1, "the"=>1} 
m = [array1.size, array2.size, h2.sum { |k,v| [v, h2[k]].min }].min
  #=> [7, 6, 7].min
  #=> 6

This does not help here, but it might if the arrays array1 and array2 were different.

like image 36
Cary Swoveland Avatar answered Oct 06 '22 07:10

Cary Swoveland


This was part of an implementation in Ruby of similar_text from PHP. Using strings:

def substrings(str)
  (0...str.size).flat_map do |i|
    (i...str.size).map { |j| str[i..j] }
  end
end

def lcs(str1, str2)
  (substrings(str1) & substrings(str2)).max_by(&:size)
end

puts "'#{lcs("hello world i am in the world", "This is the hello world message")}'"

=> 'hello world '

Brute force for the substrings might be a candidate for a Rust FFI call? We weren't doing very big compares so it worked for us.

like image 36
seph Avatar answered Oct 06 '22 06:10

seph