Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Short example of regular expression converted to a state machine?

Tags:

In the Stack Overflow podcast #36 (https://blog.stackoverflow.com/2009/01/podcast-36/), this opinion was expressed: Once you understand how easy it is to set up a state machine, you’ll never try to use a regular expression inappropriately ever again.

I've done a bunch of searching. I've found some academic papers and other complicated examples, but I'd like to find a simple example that would help me understand this process. I use a lot of regular expressions, and I'd like to make sure I never use one "inappropriately" ever again.

like image 241
slothbear Avatar asked Feb 08 '09 02:02

slothbear


People also ask

What is an example of a state machine?

There are many more examples of finite state machines we could use: a vending machine. a subway entrance turnstile. a heating system.

What is regular expression for the final state machine?

Regular expressions describe patterns which can be recognized by finite state machines (FSM). It is possible to algorithmically construct a FSM that corresponds to a given regular expression. A FSM can be described by a transition table (program), which can be represented by a string.


2 Answers

Sure, although you'll need more complicated examples to truly understand how REs work. Consider the following RE:

^[A-Za-z][A-Za-z0-9_]*$ 

which is a typical identifier (must start with alpha and can have any number of alphanumeric and undescore characters following, including none). The following pseudo-code shows how this can be done with a finite state machine:

state = FIRSTCHAR for char in all_chars_in(string):     if state == FIRSTCHAR:             if char is not in the set "A-Z" or "a-z":                 error "Invalid first character"             state = SUBSEQUENTCHARS             next char     if state == SUBSEQUENTCHARS:             if char is not in the set "A-Z" or "a-z" or "0-9" or "_":                 error "Invalid subsequent character"             state = SUBSEQUENTCHARS             next char 

Now, as I said, this is a very simple example. It doesn't show how to do greedy/nongreedy matches, backtracking, matching within a line (instead of the whole line) and other more esoteric features of state machines that are easily handled by the RE syntax.

That's why REs are so powerful. The actual finite state machine code required to do what a one-liner RE can do is usually very long and complex.

The best thing you could do is grab a copy of some lex/yacc (or equivalent) code for a specific simple language and see the code it generates. It's not pretty (it doesn't have to be since it's not supposed to be read by humans, they're supposed to be looking at the lex/yacc code) but it may give you a better idea as to how they work.

like image 152
paxdiablo Avatar answered Oct 28 '22 15:10

paxdiablo


A rather convenient way to help look at this to use python's little-known re.DEBUG flag on any pattern:

>>> re.compile(r'<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>', re.DEBUG) literal 60 subpattern 1   in     range (65, 90)   max_repeat 0 65535     in       range (65, 90)       range (48, 57) at at_boundary max_repeat 0 65535   not_literal 62 literal 62 subpattern 2   min_repeat 0 65535     any None literal 60 literal 47 groupref 1 literal 62 

The numbers after 'literal' and 'range' refer to the integer values of the ascii characters they're supposed to match.

like image 33
ʞɔıu Avatar answered Oct 28 '22 14:10

ʞɔıu