Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Virtual machine from regular expression

I read Regular Expression Matching: the Virtual Machine Approach and now I try to parse a regular expression and create a virtual machine from it. The tokenizer works and creates its tokens. After that step, I create the reversed polish notation from the token stream so at the end I get

a b c | |

from the regular expression a|(b|c). Well, now the step where I stuck: I want to get an array

0: split 1, 3
1: match 'a'
2: jump 7
3: split 4, 6
4: match 'b'
5: jump 7
6: match 'c'
7: noop

from the stream above. And I did not get it right... I use an output array and a stack for the start positions of each token. First, the 3 values are added to the output (and it's start positions to the stack).

output              stack
------------------- ------
0: match 'a'        0: 0
1: match 'b'        1: 1
2: match 'c'        2: 2

With |, I pop the last 2 positions from the stack and insert split and jump at the specific positions. The values are calculated based on the current stack length and the amount of elements I add. At the end, I add the new start-position of the last element to the stack (remains the same in this case).

output              stack
------------------- ------
0: match 'a'        0: 0
1: split 2, 4       1: 1
2: match 'b'
3: jump 5
4: match 'c'

That seems ok. Now, the next | is popped...

output              stack
------------------- ------
0: split 1, 3       0: 0
1: match 'a'
2: jump 7
3: split 2, 4
4: match 'b'
5: jump 5
6: match 'c'

And here's the problem. I have to update all the addresses that I calculated before (lines 3 and 5). That's not what I want to. I guess, relative addresses have the same problem (at least if the values are negative).

So my question is, how to create a vm from regex. Am I on the right track (with the rpn-form) or is there another (and/or easier) way?

The output array is stored as an integer array. The split-command needs in fact 3 entries, jump needs two, ...

like image 954
mal-raten Avatar asked May 22 '15 13:05

mal-raten


People also ask

What is V in regular expression?

\v represents the "vertical tab character", or ASCII 11.

What is () in regular expression?

The () will allow you to read exactly which characters were matched. Parenthesis are also useful for OR'ing two expressions with the bar | character. For example, (a-z|0-9) will match one character -- any of the lowercase alpha or digit.

What is the use of (? S in regular expression?

The Difference Between \s and \s+ For example, expression X+ matches one or more X characters. Therefore, the regular expression \s matches a single whitespace character, while \s+ will match one or more whitespace characters.

Which are 3 uses of regular expression?

Regular expressions are useful in any scenario that benefits from full or partial pattern matches on strings. These are some common use cases: verify the structure of strings. extract substrings form structured strings.


1 Answers

It would be easier to use relative jumps and splits instead.

  • a — Push a match to the stack

    0: match 'a'
    
  • b — Push a match to the stack

    0: match 'a'
    --
    0: match 'b'
    
  • c — Push a match to the stack

    0: match 'a'
    --
    0: match 'b'
    --
    0: match 'c'
    
  • | — Pop two frames from the stack, and instead push split <frame1> jump <frame2>

    0: match 'a'
    --
    0: split +1, +3
    1: match 'b'
    2: jump +2
    3: match 'c'
    
  • | — Pop two frames from the stack, and instead push split <frame1> jump <frame2>

    0: split +1, +3
    1: match 'a'
    2: jump +5
    3: split +1, +3
    4: match 'b'
    5: jump +2
    6: match 'c'
    

If you really need absolute jumps instead, you could easily iterate through and adjust all offsets.

like image 83
Markus Jarderot Avatar answered Sep 20 '22 01:09

Markus Jarderot