Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to match multiple heredoc expressions with regexes?

Tags:

regex

Is there any regex engine which would allow me to match multiple heredoc strings on an expression? E.g., as one would write in Ruby:

f <<FOO, 10, <<BAR, 20
some text
FOO
some more text
BAR

I tought of using backrefs and recursive calling in Perl's flavor, but couldn't manage to make the cross-serial dependencies work (i.e., couldn't reverse the captured backrefs, as FOO should match before BAR). I also thought of balancing groups on .Net, where I can reverse the stack by using lookaheads (I know, this is a terrible hack), like this:

(?:(?<x>foo|bar|baz)|\s)+(?(x)|(?!))\s*(?(x)(?=(.*?)(?<-x>(?<y>\k<x>)))){3}(?(x)(?!))(?:(?(y)(?<-y>\k<y>))|\s)+(?(x)(?!))(?(y)(?!))

(Click here to test it.)

That matches foo bar baz foo bar baz, but then I have to add a manual counter (the {3}), since the lookahead won't repeat with + since it doesn't consume any input I assume. Thus this won't work on arbitrary cases (but it was close!). I could, of course, replace that by {1000} or any other big number, and that would answer my question, but I wonder if there are other ways.

Acknowledgment: I do understand it is not a good idea to match such kind of construct with regexes. I am doing research work on such, and I want to find out if it is possible. If it is, please do not use it in production code.

like image 432
paulotorrens Avatar asked Jul 31 '16 20:07

paulotorrens


1 Answers

Since your primary question is "is it possible with regex", I would like to start by sharing my favorite regex information site. Specifically, How does the regex engine work? Learning that will give you a better idea of how regex works, and why trying to step out of it's very well defined box quickly devolves into broken souls and CPUs.

The key takeaway though is that at any point, the regex engine only has 2 pieces of information.

  1. What token am I trying to match?
  2. What is the next token in the string?

This is easy to forget because unlike steam parsing, the regex engine can backtrack when it fails a match (which, it usually is doing a lot of). And CPUs are fast enough that it can fail millions of times per second! While Regex may appear to have more memory because it can match "The first cat after dog", it only knows it has seen the word dog, because it is currently looking for the c in cat. Or in other words, the current state is only possible because a specific precondition(s) was met.

With a finite number of permutations of a pattern, a long enough regex can match anything. (The length of that regex may be soul/CPU crushing, but technically possible)

Where the pattern is not finite, like "match a number of a followed by an equal number of bs" (ex "ab" "aabb" "aaabbb" ect.) Regex has no mechanism to remember how many a's it has seen, so it doesn't know how many b's to match. You can work around this by trying to match all the variations (ab|aabb|aaabbb|aaaabbbb|...), but this would be insanely expensive to parse, and you would fail to capture every valid input because I can always add one more ab pair.

So really, you need to ask yourself 2 questions

  1. Is there a finite number of permutations I care to match?
  2. Will my soul/CPU survive such regex?

Related, you should probably read up on Nondeterministic finite automaton. Since you are asking for research reasons, and any pure regex engine is a NDFA, it helps to know they suffer from the same known limitations.

TL:DR;

Using a pure regex engine...

Practically, Yes, but at the cost of a soul.
Theoretically, No, not at all. There will always be a valid case your regex will fail.

like image 133
Tezra Avatar answered Oct 06 '22 20:10

Tezra