I read Regular Expression Matching: the Virtual Machine Approach and now I try to parse a regular expression and create a virtual machine from it. The tokenizer works and creates its tokens. After that step, I create the reversed polish notation from the token stream so at the end I get <pre class="prettyprint"><code>a b c | | </code></pre> from the regular expression <code>a|(b|c)</code>. Well, now the step where I stuck: I want to get an array <pre class="prettyprint"><code>0: split 1, 3 1: match 'a' 2: jump 7 3: split 4, 6 4: match 'b' 5: jump 7 6: match 'c' 7: noop </code></pre> from the stream above. And I did not get it right... I use an output array and a stack for the start positions of each token. First, the 3 values are added to the output (and it's start positions to the stack). <pre class="prettyprint"><code>output stack ------------------- ------ 0: match 'a' 0: 0 1: match 'b' 1: 1 2: match 'c' 2: 2 </code></pre> With <code>|</code>, I pop the last 2 positions from the stack and insert <code>split</code> and <code>jump</code> at the specific positions. The values are calculated based on the current stack length and the amount of elements I add. At the end, I add the new start-position of the last element to the stack (remains the same in this case). <pre class="prettyprint"><code>output stack ------------------- ------ 0: match 'a' 0: 0 1: split 2, 4 1: 1 2: match 'b' 3: jump 5 4: match 'c' </code></pre> That seems ok. Now, the next <code>|</code> is popped... <pre class="prettyprint"><code>output stack ------------------- ------ 0: split 1, 3 0: 0 1: match 'a' 2: jump 7 3: split 2, 4 4: match 'b' 5: jump 5 6: match 'c' </code></pre> And here's the problem. I have to update all the addresses that I calculated before (lines 3 and 5). That's not what I want to. I guess, relative addresses have the same problem (at least if the values are negative). So my question is, how to create a vm from regex. Am I on the right track (with the rpn-form) or is there another (and/or easier) way? The output array is stored as an integer array. The <code>split</code>-command needs in fact 3 entries, <code>jump</code> needs two, ...

It would be easier to use relative jumps and splits instead. <ul> <li> <code>a</code> — Push a <code>match</code> to the stack <pre class="prettyprint"><code>0: match 'a' </code></pre> </li> <li> <code>b</code> — Push a <code>match</code> to the stack <pre class="prettyprint"><code>0: match 'a' -- 0: match 'b' </code></pre> </li> <li> <code>c</code> — Push a <code>match</code> to the stack <pre class="prettyprint"><code>0: match 'a' -- 0: match 'b' -- 0: match 'c' </code></pre> </li> <li> <code>|</code> — Pop two frames from the stack, and instead push <code>split <frame1> jump <frame2></code> <pre class="prettyprint"><code>0: match 'a' -- 0: split +1, +3 1: match 'b' 2: jump +2 3: match 'c' </code></pre> </li> <li> <code>|</code> — Pop two frames from the stack, and instead push <code>split <frame1> jump <frame2></code> <pre class="prettyprint"><code>0: split +1, +3 1: match 'a' 2: jump +5 3: split +1, +3 4: match 'b' 5: jump +2 6: match 'c' </code></pre> </li> </ul> If you really need absolute jumps instead, you could easily iterate through and adjust all offsets.

Virtual machine from regular expression

Tags:

regex

vm-implementation

I read Regular Expression Matching: the Virtual Machine Approach and now I try to parse a regular expression and create a virtual machine from it. The tokenizer works and creates its tokens. After that step, I create the reversed polish notation from the token stream so at the end I get

a b c | |

from the regular expression a|(b|c). Well, now the step where I stuck: I want to get an array

0: split 1, 3
1: match 'a'
2: jump 7
3: split 4, 6
4: match 'b'
5: jump 7
6: match 'c'
7: noop

from the stream above. And I did not get it right... I use an output array and a stack for the start positions of each token. First, the 3 values are added to the output (and it's start positions to the stack).

output              stack
------------------- ------
0: match 'a'        0: 0
1: match 'b'        1: 1
2: match 'c'        2: 2

With |, I pop the last 2 positions from the stack and insert split and jump at the specific positions. The values are calculated based on the current stack length and the amount of elements I add. At the end, I add the new start-position of the last element to the stack (remains the same in this case).

output              stack
------------------- ------
0: match 'a'        0: 0
1: split 2, 4       1: 1
2: match 'b'
3: jump 5
4: match 'c'

That seems ok. Now, the next | is popped...

output              stack
------------------- ------
0: split 1, 3       0: 0
1: match 'a'
2: jump 7
3: split 2, 4
4: match 'b'
5: jump 5
6: match 'c'

And here's the problem. I have to update all the addresses that I calculated before (lines 3 and 5). That's not what I want to. I guess, relative addresses have the same problem (at least if the values are negative).

So my question is, how to create a vm from regex. Am I on the right track (with the rpn-form) or is there another (and/or easier) way?

The output array is stored as an integer array. The split-command needs in fact 3 entries, jump needs two, ...

954

asked May 22 '15 13:05

mal-raten

1 Answers

It would be easier to use relative jumps and splits instead.

a — Push a match to the stack
```
0: match 'a'
```
b — Push a match to the stack
```
0: match 'a'
--
0: match 'b'
```

c — Push a match to the stack

0: match 'a'
--
0: match 'b'
--
0: match 'c'

| — Pop two frames from the stack, and instead push split <frame1> jump <frame2>
```
0: match 'a'
--
0: split +1, +3
1: match 'b'
2: jump +2
3: match 'c'
```

| — Pop two frames from the stack, and instead push split <frame1> jump <frame2>

0: split +1, +3
1: match 'a'
2: jump +5
3: split +1, +3
4: match 'b'
5: jump +2
6: match 'c'

If you really need absolute jumps instead, you could easily iterate through and adjust all offsets.

answered Sep 20 '22 01:09

Markus Jarderot

Related questions
                            
                                C# Regular Expression: Remove leading and trailing double quotes (")
                            
                                How to move all digits in a string to the beginning of the string?
                            
                                Ruby extract data from string using regex
                            
                                How to extract decimal number from string in C#
                            
                                Constructing regex pattern to match sentence
                            
                                laravel validation rule for only letters
                            
                                Why are people using regexp for email and other complex validation?
                            
                                ng-pattern for only numbers will accept chars like '-' in angular.js
                            
                                Format a string using regex in Java
                            
                                How to match with regex all special chars except "-" in PHP?
                            
                                Remove every white space between tags using JavaScript
                            
                                Notepad++ Search and Replace: delete all after "/" in each row
                            
                                How do I write a regex in PHP to remove special characters?
                            
                                Regex to remove HTML attribute from any HTML tag (style="")?
                            
                                Replace Comma(,) with Dot(.) RegEx php
                            
                                Match strings with regular expression in ignore case
                            
                                Find last character in a string in PHP
                            
                                pg_dump --exclude-table pattern matching
                            
                                Convert Json date string to JavaScript date object
                            
                                Blackberry Bold- Unable to recognize URLs and even custom patterns registered

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With