In the Stack Overflow podcast #36 (https://blog.stackoverflow.com/2009/01/podcast-36/), this opinion was expressed: Once you understand how easy it is to set up a state machine, you’ll never try to use a regular expression inappropriately ever again.
I've done a bunch of searching. I've found some academic papers and other complicated examples, but I'd like to find a simple example that would help me understand this process. I use a lot of regular expressions, and I'd like to make sure I never use one "inappropriately" ever again.
There are many more examples of finite state machines we could use: a vending machine. a subway entrance turnstile. a heating system.
Regular expressions describe patterns which can be recognized by finite state machines (FSM). It is possible to algorithmically construct a FSM that corresponds to a given regular expression. A FSM can be described by a transition table (program), which can be represented by a string.
Sure, although you'll need more complicated examples to truly understand how REs work. Consider the following RE:
^[A-Za-z][A-Za-z0-9_]*$
which is a typical identifier (must start with alpha and can have any number of alphanumeric and undescore characters following, including none). The following pseudo-code shows how this can be done with a finite state machine:
state = FIRSTCHAR for char in all_chars_in(string): if state == FIRSTCHAR: if char is not in the set "A-Z" or "a-z": error "Invalid first character" state = SUBSEQUENTCHARS next char if state == SUBSEQUENTCHARS: if char is not in the set "A-Z" or "a-z" or "0-9" or "_": error "Invalid subsequent character" state = SUBSEQUENTCHARS next char
Now, as I said, this is a very simple example. It doesn't show how to do greedy/nongreedy matches, backtracking, matching within a line (instead of the whole line) and other more esoteric features of state machines that are easily handled by the RE syntax.
That's why REs are so powerful. The actual finite state machine code required to do what a one-liner RE can do is usually very long and complex.
The best thing you could do is grab a copy of some lex/yacc (or equivalent) code for a specific simple language and see the code it generates. It's not pretty (it doesn't have to be since it's not supposed to be read by humans, they're supposed to be looking at the lex/yacc code) but it may give you a better idea as to how they work.
A rather convenient way to help look at this to use python's little-known re.DEBUG flag on any pattern:
>>> re.compile(r'<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>', re.DEBUG) literal 60 subpattern 1 in range (65, 90) max_repeat 0 65535 in range (65, 90) range (48, 57) at at_boundary max_repeat 0 65535 not_literal 62 literal 62 subpattern 2 min_repeat 0 65535 any None literal 60 literal 47 groupref 1 literal 62
The numbers after 'literal' and 'range' refer to the integer values of the ascii characters they're supposed to match.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With