I read Regular Expression Matching: the Virtual Machine Approach and now I try to parse a regular expression and create a virtual machine from it. The tokenizer works and creates its tokens. After that step, I create the reversed polish notation from the token stream so at the end I get
a b c | |
from the regular expression a|(b|c)
.
Well, now the step where I stuck: I want to get an array
0: split 1, 3
1: match 'a'
2: jump 7
3: split 4, 6
4: match 'b'
5: jump 7
6: match 'c'
7: noop
from the stream above. And I did not get it right... I use an output array and a stack for the start positions of each token. First, the 3 values are added to the output (and it's start positions to the stack).
output stack
------------------- ------
0: match 'a' 0: 0
1: match 'b' 1: 1
2: match 'c' 2: 2
With |
, I pop the last 2 positions from the stack and insert split
and jump
at the specific positions. The values are calculated based on the current stack length and the amount of elements I add.
At the end, I add the new start-position of the last element to the stack (remains the same in this case).
output stack
------------------- ------
0: match 'a' 0: 0
1: split 2, 4 1: 1
2: match 'b'
3: jump 5
4: match 'c'
That seems ok. Now, the next |
is popped...
output stack
------------------- ------
0: split 1, 3 0: 0
1: match 'a'
2: jump 7
3: split 2, 4
4: match 'b'
5: jump 5
6: match 'c'
And here's the problem. I have to update all the addresses that I calculated before (lines 3 and 5). That's not what I want to. I guess, relative addresses have the same problem (at least if the values are negative).
So my question is, how to create a vm from regex. Am I on the right track (with the rpn-form) or is there another (and/or easier) way?
The output array is stored as an integer array. The split
-command needs in fact 3 entries, jump
needs two, ...
\v represents the "vertical tab character", or ASCII 11.
The () will allow you to read exactly which characters were matched. Parenthesis are also useful for OR'ing two expressions with the bar | character. For example, (a-z|0-9) will match one character -- any of the lowercase alpha or digit.
The Difference Between \s and \s+ For example, expression X+ matches one or more X characters. Therefore, the regular expression \s matches a single whitespace character, while \s+ will match one or more whitespace characters.
Regular expressions are useful in any scenario that benefits from full or partial pattern matches on strings. These are some common use cases: verify the structure of strings. extract substrings form structured strings.
It would be easier to use relative jumps and splits instead.
a
— Push a match
to the stack
0: match 'a'
b
— Push a match
to the stack
0: match 'a'
--
0: match 'b'
c
— Push a match
to the stack
0: match 'a'
--
0: match 'b'
--
0: match 'c'
|
— Pop two frames from the stack, and instead push split <frame1> jump <frame2>
0: match 'a'
--
0: split +1, +3
1: match 'b'
2: jump +2
3: match 'c'
|
— Pop two frames from the stack, and instead push split <frame1> jump <frame2>
0: split +1, +3
1: match 'a'
2: jump +5
3: split +1, +3
4: match 'b'
5: jump +2
6: match 'c'
If you really need absolute jumps instead, you could easily iterate through and adjust all offsets.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With