I have run into a problem with the ruby regex. I need to find all (potentially overlapping) matches. This is a simplification of the problem:
#Simple example
"Hey".scan(/../)
=> ["He"]
#Actual results
#With overlapping matches the result should be
=> ["He"], ["ey"]
The regex I am trying to execute and get all results for looks like this:
"aaaaaa".scan(/^(..+)\1+$/) #This looks for multiples of (here) "a" bigger than one that "fills" the entire string. "aa"*3 => true, "aaa"*2 => true. "aaaa"*1,5 => false.
=> [["aaa"]]
#With overlapping results this should be
=> [["aa"],["aaa"]]
Is there a library or a way to do regex in ruby to get the results I am after?
I found some clues that this was possible in Perl, but after hours of research I did not find anything about a Ruby way of doing this.
However I was able to find this "Javascript Regex - Find all possible matches, even in already captured matches", but I am not able to find anything similar for Ruby, nor find something similar to the last index property in the Ruby version. To be honest I don't think that it would have worked anyways since the regex I intend to use is recursive and relies on the entire string, while that method chops away at the string.
Kind of old topic... Not sure if I understand, but best I can find is this:
"Hey".scan(/(?=(..))/)
=> [["He"], ["ey"]]
"aaaaaa".scan(/(?=(..+)\1)/)
=> [["aaa"], ["aa"], ["aa"]]
scan walks thru every byte and the "positive look-ahead" (?=)
tests the regexp (..+)\1
in every step. Look-aheads don't consume bytes, but the capture group inside it returns the match if it exists.
Are you just missing the second capture group?
"aaaaaa".scan(/(..+?)(\1+)/)
#=> [["aa", "aaaa"]]
It seems like there might be something wrong with your expectation.
The problem with any solution based on scan
is it won't find overlapping matches as scan
always makes forward progress. It might be possible to recast the regexp so it's entirely embedded in a zero-width positive lookahead and then use scan
, but IIRC there are otherwise valid regexp patterns that do not work in lookahead or lookbehind.
There's some ambiguity in the question as posed. This interprets the question as really asking to find all the unique matching substrings of a target string for which a regexp will match. Though not strictly necessary it uses ruby 2.0 lazy evaluation to avoid excessive intermediate array allocations.
class String
def each_substring
Enumerator.new do |y|
(0...length).each do |b|
(b...length).each do |e|
y << self[b..e]
end
end
y << ''
end
end
end
class Regexp
def all_possible_matches(str)
str.each_substring.lazy.
map { |s| match(s) }.
reject(&:nil?).
map { |m| m.size > 1 ? m[1..-1] : m[0] }.
to_a.uniq
end
end
/.{2,4}/.all_possible_matches('abcde')
=> ["ab", "abc", "abcd", "bc", "bcd", "bcde", "cd", "cde", "de"]
/^(..+?)\1+$/.all_possible_matches('aaaaaa')
=> [["aa"]]
/^(..+)\1+$/.all_possible_matches('aaaaaa')
=> [["aa"], ["aaa"]]
/^(..+?)\1+$/.all_possible_matches('aaaaaaaaa')
=> [["aa"], ["aaa"]]
/^(..+)\1+$/.all_possible_matches('aaaaaaaaa')
=> [["aa"], ["aaa"], ["aaaa"]]
EDIT: made it return capture groups when present. The OP's desired solution to the non-greedy form of /^(..+?)\1+$/
is wrong as the ?
means it will be satisfied with the pattern with the fewest chars.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With