Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby Regex, get all possible matches (no clipping of the string)

Tags:

regex

ruby

I have run into a problem with the ruby regex. I need to find all (potentially overlapping) matches. This is a simplification of the problem:

#Simple example
"Hey".scan(/../)
=> ["He"] 
#Actual results

#With overlapping matches the result should be
=> ["He"], ["ey"]

The regex I am trying to execute and get all results for looks like this:

"aaaaaa".scan(/^(..+)\1+$/) #This looks for multiples of (here) "a" bigger than one that "fills" the entire string. "aa"*3 => true, "aaa"*2 => true. "aaaa"*1,5 => false.
 => [["aaa"]] 

#With overlapping results this should be
 => [["aa"],["aaa"]]

Is there a library or a way to do regex in ruby to get the results I am after?

I found some clues that this was possible in Perl, but after hours of research I did not find anything about a Ruby way of doing this.

However I was able to find this "Javascript Regex - Find all possible matches, even in already captured matches", but I am not able to find anything similar for Ruby, nor find something similar to the last index property in the Ruby version. To be honest I don't think that it would have worked anyways since the regex I intend to use is recursive and relies on the entire string, while that method chops away at the string.

like image 631
Automatico Avatar asked May 04 '13 23:05

Automatico


3 Answers

Kind of old topic... Not sure if I understand, but best I can find is this:

"Hey".scan(/(?=(..))/)
 => [["He"], ["ey"]] 

"aaaaaa".scan(/(?=(..+)\1)/)
 => [["aaa"], ["aa"], ["aa"]] 

scan walks thru every byte and the "positive look-ahead" (?=) tests the regexp (..+)\1 in every step. Look-aheads don't consume bytes, but the capture group inside it returns the match if it exists.

like image 80
Fernando Fabreti Avatar answered Oct 26 '22 15:10

Fernando Fabreti


Are you just missing the second capture group?

"aaaaaa".scan(/(..+?)(\1+)/)
#=> [["aa", "aaaa"]]

It seems like there might be something wrong with your expectation.

like image 22
pguardiario Avatar answered Oct 26 '22 14:10

pguardiario


The problem with any solution based on scan is it won't find overlapping matches as scan always makes forward progress. It might be possible to recast the regexp so it's entirely embedded in a zero-width positive lookahead and then use scan, but IIRC there are otherwise valid regexp patterns that do not work in lookahead or lookbehind.

There's some ambiguity in the question as posed. This interprets the question as really asking to find all the unique matching substrings of a target string for which a regexp will match. Though not strictly necessary it uses ruby 2.0 lazy evaluation to avoid excessive intermediate array allocations.

class String
  def each_substring
    Enumerator.new do |y|
      (0...length).each do |b|
        (b...length).each do |e|
          y << self[b..e]
        end
      end
      y << '' 
    end
  end
end

class Regexp
  def all_possible_matches(str)
    str.each_substring.lazy.
    map { |s| match(s) }.
    reject(&:nil?).
    map { |m| m.size > 1 ? m[1..-1] : m[0] }.
    to_a.uniq
  end
end

/.{2,4}/.all_possible_matches('abcde')
=> ["ab", "abc", "abcd", "bc", "bcd", "bcde", "cd", "cde", "de"]

/^(..+?)\1+$/.all_possible_matches('aaaaaa')
=> [["aa"]]
/^(..+)\1+$/.all_possible_matches('aaaaaa')
=> [["aa"], ["aaa"]]
/^(..+?)\1+$/.all_possible_matches('aaaaaaaaa')
=> [["aa"], ["aaa"]]
/^(..+)\1+$/.all_possible_matches('aaaaaaaaa')
=> [["aa"], ["aaa"], ["aaaa"]]

EDIT: made it return capture groups when present. The OP's desired solution to the non-greedy form of /^(..+?)\1+$/ is wrong as the ? means it will be satisfied with the pattern with the fewest chars.

like image 27
dbenhur Avatar answered Oct 26 '22 16:10

dbenhur