Lets's say I have two arrays of words:
array1 = ["hello", "world", "i", "am", "in", "the", "world"]
array2 = ["This", "is", "the", "hello", "world", "message"]
which, could just as easily be represented by two strings:
string1 = "hello world i am in the world"
string2 = "This is the hello world message"
Lets assume I'm going with arrays for now.
I want to find the largest sub-array of array2 which appears in the same order in array1.
So, if you were going to do it in the slowest way imaginable, say, you would say:
But, this feels like it would be quite inefficient. Can anyone see a better method? I'm using Ruby, but i'd be interested in a general algorithm as well as how to do it in that specific language.
Note that this isn't just array intersection because that (at least in ruby) doesn't care about the order of elements, which I do care about.
thanks!
Here's a fast working solution, cutting down the comparison to only those element in common across arrays:
array1 = ["hello", "world", "i", "am", "in", "the", "world"]
array2 = ["This", "is", "the", "hello", "world", "message"]
common_words = array1 & array2
stringified_array1 = array1.join(' ')
stringified_array2 = array2.join(' ')
(common_words.length - 1).downto(0).map do |n|
stringified_combo = array1[0..n].join(' ')
if stringified_array1.include?(stringified_combo) && stringified_array2.include?(stringified_combo)
stringified_combo.split($,)
end
end.compact.max
This you get the words in common between the two arrays, and tests these from largest to smallest. You check they're in order in the first array, then if they exist in the second.
I believe this holds up and is pretty efficient, though happy to receive any comments and feedback,
This seems to be a take on the "longest common substring" problem, but using words instead of characters in strings.
This wiki (https://en.wikipedia.org/wiki/Longest_common_substring_problem) outlines the dynamic programming approach to locate the largest common match in pseudo code and could be ported over to ruby passing the two arrays as the params.
function LCSubstr(S[1..r], T[1..n])
L := array(1..r, 1..n)
z := 0
ret := {}
for i := 1..r
for j := 1..n
if S[i] == T[j]
if i == 1 or j == 1
L[i,j] := 1
else
L[i,j] := L[i-1,j-1] + 1
if L[i,j] > z
z := L[i,j]
ret := {S[i-z+1..i]}
else
if L[i,j] == z
ret := ret ∪ {S[i-z+1..i]}
else
L[i,j] := 0
return ret
Do the "six word" test straight off Then I'd the go through each word in the second array and test to see if it's in the 1st. If it is then look for it and the one after, if both then look for the next.
ie once you've discovered "This" is not present in the first array, you've also discarded the five other potential candidates starting with this.
Until I learned about the "longest common substring/sequence problem" (see @Dustin's answer) I did not think there is an approach that is better than the one you outlined in the question: start with the largest possible subarray (array2
itself), then sequentially reduce the size of subarrays by one until a match is found (or it is determined that the two arrays contain no common element). Though I now see there is more efficient way, your idea is certainly not a bad one, particularly if the substrings are not too large, and it is easier to implement than the dynamic programming solution Dustin referenced. I've implemented your idea below.
I have chosen to use a regular expression to identify matches, so I need to convert array1
to a string.
str1 = array1.join(' ')
#=> "hello world i am in the world"
The calculation is as follows.
[array1.size, array2.size].min.downto(1).each do |n|
a = array2.each_cons(n).find { |a| str1.match?(/\b#{a.join(' ')}\b/) }
break a unless a.nil?
end
#=> ["hello", "world"]
nil
is returned if array1
and array2
have no common element. If desired one could first test (array1 & array2).empty?
.
Here is a possible improvement to what I have above. The idea is to attempt to reduce m
in m.downto(1)
.
h1 = array1.each_with_object(Hash.new(0)) { |word, h| h[word] += 1 }
#=> {"hello"=>1, "world"=>2, "i"=>1, "am"=>1, "in"=>1, "the"=>1}
h2 = array1.each_with_object(Hash.new(0)) { |word, h| h[word] += 1 }
#=> {"hello"=>1, "world"=>2, "i"=>1, "am"=>1, "in"=>1, "the"=>1}
m = [array1.size, array2.size, h2.sum { |k,v| [v, h2[k]].min }].min
#=> [7, 6, 7].min
#=> 6
This does not help here, but it might if the arrays array1
and array2
were different.
This was part of an implementation in Ruby of similar_text from PHP. Using strings:
def substrings(str)
(0...str.size).flat_map do |i|
(i...str.size).map { |j| str[i..j] }
end
end
def lcs(str1, str2)
(substrings(str1) & substrings(str2)).max_by(&:size)
end
puts "'#{lcs("hello world i am in the world", "This is the hello world message")}'"
=> 'hello world '
Brute force for the substrings might be a candidate for a Rust FFI call? We weren't doing very big compares so it worked for us.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With