In the Stack Overflow podcast #36 (https://blog.stackoverflow.com/2009/01/podcast-36/), this opinion was expressed: Once you understand how easy it is to set up a state machine, you’ll never try to use a regular expression inappropriately ever again. I've done a bunch of searching. I've found some academic papers and other complicated examples, but I'd like to find a simple example that would help me understand this process. I use a lot of regular expressions, and I'd like to make sure I never use one "inappropriately" ever again.

Sure, although you'll need more complicated examples to truly understand how REs work. Consider the following RE: <pre class="prettyprint"><code>^[A-Za-z][A-Za-z0-9_]*$ </code></pre> which is a typical identifier (must start with alpha and can have any number of alphanumeric and undescore characters following, including none). The following pseudo-code shows how this can be done with a finite state machine: <pre class="prettyprint"><code>state = FIRSTCHAR for char in all_chars_in(string): if state == FIRSTCHAR: if char is not in the set "A-Z" or "a-z": error "Invalid first character" state = SUBSEQUENTCHARS next char if state == SUBSEQUENTCHARS: if char is not in the set "A-Z" or "a-z" or "0-9" or "_": error "Invalid subsequent character" state = SUBSEQUENTCHARS next char </code></pre> Now, as I said, this is a very simple example. It doesn't show how to do greedy/nongreedy matches, backtracking, matching within a line (instead of the whole line) and other more esoteric features of state machines that are easily handled by the RE syntax. That's why REs are so powerful. The actual finite state machine code required to do what a one-liner RE can do is usually very long and complex. The best thing you could do is grab a copy of some lex/yacc (or equivalent) code for a specific simple language and see the code it generates. It's not pretty (it doesn't have to be since it's not supposed to be read by humans, they're supposed to be looking at the lex/yacc code) but it may give you a better idea as to how they work.

Short example of regular expression converted to a state machine?

Tags:

In the Stack Overflow podcast #36 (https://blog.stackoverflow.com/2009/01/podcast-36/), this opinion was expressed: Once you understand how easy it is to set up a state machine, you’ll never try to use a regular expression inappropriately ever again.

I've done a bunch of searching. I've found some academic papers and other complicated examples, but I'd like to find a simple example that would help me understand this process. I use a lot of regular expressions, and I'd like to make sure I never use one "inappropriately" ever again.

241

asked Feb 08 '09 02:02

slothbear

2 Answers

Sure, although you'll need more complicated examples to truly understand how REs work. Consider the following RE:

^[A-Za-z][A-Za-z0-9_]*$

which is a typical identifier (must start with alpha and can have any number of alphanumeric and undescore characters following, including none). The following pseudo-code shows how this can be done with a finite state machine:

state = FIRSTCHAR for char in all_chars_in(string):     if state == FIRSTCHAR:             if char is not in the set "A-Z" or "a-z":                 error "Invalid first character"             state = SUBSEQUENTCHARS             next char     if state == SUBSEQUENTCHARS:             if char is not in the set "A-Z" or "a-z" or "0-9" or "_":                 error "Invalid subsequent character"             state = SUBSEQUENTCHARS             next char

Now, as I said, this is a very simple example. It doesn't show how to do greedy/nongreedy matches, backtracking, matching within a line (instead of the whole line) and other more esoteric features of state machines that are easily handled by the RE syntax.

That's why REs are so powerful. The actual finite state machine code required to do what a one-liner RE can do is usually very long and complex.

The best thing you could do is grab a copy of some lex/yacc (or equivalent) code for a specific simple language and see the code it generates. It's not pretty (it doesn't have to be since it's not supposed to be read by humans, they're supposed to be looking at the lex/yacc code) but it may give you a better idea as to how they work.

152

answered Oct 28 '22 15:10

paxdiablo

A rather convenient way to help look at this to use python's little-known re.DEBUG flag on any pattern:

>>> re.compile(r'<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>', re.DEBUG) literal 60 subpattern 1   in     range (65, 90)   max_repeat 0 65535     in       range (65, 90)       range (48, 57) at at_boundary max_repeat 0 65535   not_literal 62 literal 62 subpattern 2   min_repeat 0 65535     any None literal 60 literal 47 groupref 1 literal 62

The numbers after 'literal' and 'range' refer to the integer values of the ascii characters they're supposed to match.

answered Oct 28 '22 14:10

ʞɔıu

Related questions
                            
                                SQLAlchemy and empty columns
                            
                                How to call setUndecorated() after a frame is made visible?
                            
                                Check if NSURL returns 404
                            
                                number_format() with MySQL
                            
                                Drag & Drop Reorder Rows on NSTableView
                            
                                How To Join Relative URLs in JavaScript
                            
                                Unable to send an email to multiple addresses/recipients using C#
                            
                                android.content.ActivityNotFoundException:
                            
                                Convert Java Date into XML Date Format (and vice versa)
                            
                                Return HTML from a user-selected text
                            
                                How to write conditional migrations in rails?
                            
                                Inserting values to SQLite table in Android

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With