Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Perl's regex implementation makes use of tries?

I'm exploring ways to optimize a stack-based backtracking regex implementation (see also this thread). In the comments a hint about Perl's regex matcher was given. From the source code I already figured that it makes use of tries when matching a regex with several alternatives.

For example, a regex like /(?:foo|bar|baz|foobar)$/ would normally yield a program like

BRANCH
  EXACT <foo>
BRANCH
  EXACT <bar>
BRANCH
  EXACT <baz>
BRANCH
  EXACT <foobar>
EOL

while the optimized version using a trie looks somewhat like

TRIE-EXACT
  <foo>
  <bar>
  <baz>
  <foobar>
EOL

The four branches with an EXACT command each were optimized into a single TRIE command.

However, I don't understand how this trie is used in the execution phase with respect to backtracking.

Consider the non-optimized code and the input string foobar. When trying to match the regex, the first branch foo would be successfully matched, the $ would fail, the next branches bar and baz would fail, the last branch foobar would match, $ would match, and the matching completes successfully.

Now how does the matching work with the trie? From my understanding a trie has only one sensible implied order in which overlapping entries are tried, namely the longest match first. This would give a correct match in the aforementioned example.

But it would give the wrong result for /(?:foo|foobar)barr$/ and the input foobarr. The trie would match foobar, but fail on the following r because it expects a b next. So there must not only be a way to backtrack in the trie, but also the original order of branches must be preserved somehow, otherwise the optimized program would be semantically different from the non-optimized version.

How does the Perl implementation handle this tree-based matching?

like image 545
siracusa Avatar asked Aug 10 '19 05:08

siracusa


2 Answers

I wrote the original implementation of this code. It has changed somewhat since then, which has slightly changed the time/space tradeoffs for how it works.

First, I need to call out one misunderstanding in your post, the rules for Perls regex engine are not to match the longest string, but rather to match the leftmost-longest string, in particular

"foal"=~/(f|fo|foa|foal)/ 

should match "f" not "foal" as the "f" option is first. This is called the "word order" in the comments in the code. You can see this with Perl by injecting (?{ print $& }) into your patterns:

perl -le'my $str="foal"; $str=~/(?:f|foa|fo)(?{print $&})al/'
f
foa
fo 

Another point I thought I should explain is that the code and comments use the term "state" and "accepting state" a lot. This comes from the observation that a Trie is equivalent to an acyclic DFA, and can be naively represented as a table of one row per state, with one column per legal character in the input alphabet and with an additional column being used to track whether the state is "accepting", meaning "ends one of the alternatives", and if so which branch of the alternation it terminates. The values in the other columns represent the new state to transition to after reading the columns character. An accepting state may have transitions to more states, as in the case where one branch matches a prefix of another.

Matching then comes down to walking the tree/state table, when we encounter an accepting state we push a 2-tuple of the number of characters we have read so far in the trie and the word number associated with the state into a list. Eventually we either run out of string to read, or encounter a transition to the 0 state which means we can stop as no further matches are possible.

Once you have a list of the tuples of (length,word) you can then go through them in order of word ascending, since such lists are generally short IMO it is reasonable to simply use linear probing to find the minimum word each try. Perl uses a strategy where it precompiles all such possible lists into one structure in advance, so it need only keep track of the branch number of the most recent accepting state, and then can (somewhat expensively) compute any preceeding accepting states from that. For instance the example below /(f|foa|fo)al/ produces a structure like this:

word_info N:(prev,len)= 1:(0,1) 2:(3,3) 3:(1,2)

Which shows that if we reach the accepting state for "foa" which is branch 2 we have matched a 3 character string and there is a preceding accepting state of branch 3, which in turn points at 1. From this we can reconstruct the behavior we would expect to see, which is to try "al" after "f", "foa" and then "fo". The process is somewhat expensive (N**2) compared to other solutions, but has the advantage of being precompiled and reusable and the storage requirements fit nicely with how the regex backtracking engine works. IMO it doesnt matter, the main issue is that somehow you track the accepting states where you were in the string when you encountered them, and then try the "tail" in the right order.

You can see much of this at work with extended debugging:

perl -Mre=Debug,TRIE,TRIEE,TRIEM -le'my $str="foal"; $str=~/(?:f|foa|fo)(?{print $&})al/' 
Compiling REx "(?:f|foa|fo)(?{print $&})al"
  Looking for TRIE'able sequences. Tail node is: EVAL
  - BRANCH (1) -> EXACT <f> => EVAL (First==-1,Last==-1,Cur==1,tt==END,nt==EXACT,nnt==END)
  - BRANCH (4) -> EXACT <foa>   => EVAL (First==1,Last==-1,Cur==4,tt==EXACT,nt==EXACT,nnt==END)
  - BRANCH (7) -> EXACT <fo>    => EVAL (First==1,Last==4,Cur==7,tt==EXACT,nt==EXACT,nnt==END)
  - TAIL (10) <SCAN FINISHED>
    make_trie start==1, first==1, last==10, tail==11 depth=1
    TRIE(NATIVE): W:3 C:6 Uq:3 Min:1 Max:3
    Compiling trie using table compiler
      Char :    f   o   a
      State+-------------
         1 :    2   .   . (   1)
         2 :    .   3   . (   1) W   1
         3 :    .   .   4 (   1) W   3
         4 :    .   .   . (   0) W   2

Here you can see the DFA/TRIE equivalency and the word data, and how the table is constructed. Because there are only 3 legal characters we build a table that has 4 columns, one per each legal character, and one to track if the state is accepting and if so which branch of the alternation it accepts. The W 2 on row 4 shows that it is accepting for the 2nd branch of the alternation.

    Alloc: 22 Orig: 13 elements, Final:3. Savings of %76.92
    Statecount:5 Lasttrans:4
      Char : Match Base  Ofs     f   o   a
      State|-----------------------------------
      #   1|       @   3 + 0[    2   .   .]
      #   2| W   1 @   3 + 1[    .   3   .]
      #   3| W   3 @   3 + 2[    .   .   4]
      #   4| W   2 @   0 

This shows the naive table representation being packed into a compressed form that allows the zero transition in one row to be used to store non-zero transitions in other rows. Each transition is stored in an array of 2-tuple of [owner-state,transition], with two side arrays, one which stores the start position of a given state, and another that stores the accepting mappings for a given state, state transitions involve checking if table[start_offset[state]+column_ofs].owner_state is the same as state, and if not then table[start_offset[state]+column_ofs].new_state can be assumed to be 0. See the Red Dragon for more details.


    word_info N:(prev,len)= 1:(0,1) 2:(3,3) 3:(1,2)

this part of the output shows how we have precomputed the length, and the predecessor branch number for every accepting state. This way we don't have to construct a list of accepting states as we match, but instead can precompute all the possible such lists at compile time into one static structure.

Final program:
   1: EXACT <f> (3)
   3: TRIE-EXACT<S:2/4 W:3 L:0/2 C:6/3>[o] (11)
      <> 
      <oa> 
      <o> 

Here you can see an interesting optimization kick in, because all of the alternations start with the letter "f" we "remove" it from the trie, and turn it into a prefix, the EXACT , followed by a trie containing the empty string, the letters "oa" and the letter "o". The S:2/4 means we start on state 2 not on the normal state 1, which is the state that corresponds to the letter 'f'. Extracting the 'f' like this allows it to be used elsewhere in the matching process, which means we can use less expensive machinery the Trie to find possible matches.

  11: EVAL (13)
  13: EXACT <al> (15)
  15: END (0)
anchored "f" at 0 floating "al" at 1..3 (checking floating) minlen 3 with eval 
Guessing start of match in sv for REx "(?:f|foa|fo)(?{print $&})al" against "foal"
Found floating substr "al" at offset 2...
Found anchored substr "f" at offset 0...

The 'f' that was extracted earlier is being used to find the first possible place in the string the pattern could match.

Guessed: match at offset 0
Matching REx "(?:f|foa|fo)(?{print $&})al" against "foal"
   0 <> <foal>               |  1:EXACT <f>(3)
   1 <f> <oal>               |  3:TRIE-EXACT<S:2/4 W:3 L:0/2 C:6/3>[o](11)
   1 <f> <oal>               |    State:    2 Accepted: Y Charid:  2 CP:  6f After State:    3
   2 <fo> <al>               |    State:    3 Accepted: Y Charid:  3 CP:  61 After State:    4
   3 <foa> <l>               |    State:    4 Accepted: Y Charid:  0 CP:   0 After State:    0

Here we can see the walk through tree starting at state 2. Each "Accepted" line means we have found a possible match. The last line shows us reading a character not in our alphabet, meaning a transition to state 0 which means we can stop. Unfortunately this does not show the word numbers associated with each match, but you can infer them from the state transition table: 1,3,2, but we need to "try" them 1,2,3.

                                  got 3 possible matches
                                  TRIE matched word #1, continuing
   1 <f> <oal>               | 11:  EVAL(13)
f
   1 <f> <oal>               | 13:  EXACT <al>(15)
                                    failed...
                                  TRIE matched word #2, continuing
   3 <foa> <l>               | 11:  EVAL(13)
foa
   3 <foa> <l>               | 13:  EXACT <al>(15)
                                    failed...
                                  TRIE matched word #3, continuing
                                  only one match left, short-circuiting: #3 <o>
   2 <fo> <al>               | 11:EVAL(13)
fo
   2 <fo> <al>               | 13:EXACT <al>(15)
   4 <foal> <>               | 15:END(0)
Match successful!

Note an explanation of different aspects of the Trie compilation process can be found at:

https://perl5.git.perl.org/perl.git/blob/HEAD:/regcomp.c#l2407

https://perl5.git.perl.org/perl.git/blob/HEAD:/regcomp.c#l4726

like image 140
demerphq Avatar answered Sep 24 '22 13:09

demerphq


You can get some information by using the re pragma to show how perl compiles and uses the regular expression. For example:

$ perl -Mre=debugcolor -e '"foobarr" =~ /(?:foobar|foo)barr$/' 

gives output:

enter image description here

So it is clear that it compiles the regex into three operations:

  • An EXACT operation which matches foo (this is the common prefix of the alternation (?:foobar|foo), followed by
  • a TRIE-EXACT operation matching bar or the empty string, and then
  • an EXACT operation matching barr.

Then the compiled regex is used in the match:

  • First the EXACT match of foo matches, then
  • the TRIE-EXACT tries the first alternative bar, which matches, then
  • the EXACT operation fails on matching r with barr,
  • the engine backtracks and tries the second word in the TRIE-EXACT operation, which matches the empty string,
  • and finally the last EXACT operation matches barr.

Currently this does not answer your question on how exactly the trie-based matching is implemented, but I plan to investigate further! See also perldoc perlreguts for more information.

Edit: More information on how the trie matching is implemented can be found in regexec.c line #5931:

[...] the basic plan of execution of the trie is: At the beginning, run through all the states, and find the longest-matching word. Also remember the position of the shortest matching word. For example, this pattern:

  1  2 3 4    5
  ab|a|x|abcd|abc

when matched against the string "abcde", will generate accept states for all words except 3, with the longest matching word being 4, and the shortest being 2 (with the position being after char 1 of the string).

Then for each matching word, in word order (i.e. 1,2,4,5), we run the remainder of the pattern; on each try setting the current position to the character following the word, returning to try the next word on failure.

This should explain how the TRIE based matching manages to do the alternations in word-order.

like image 39
Håkon Hægland Avatar answered Sep 21 '22 13:09

Håkon Hægland