Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Docs Wrong About Regular Expression "\b"?

Tags:

python

regex

As a result of getting help with a question I had yesterday - Python 2.7 - find and replace from text file, using dictionary, to new text file - I started learning regular expressions today to understand the regular expressions code that @Blckknght had kindly created for me in his answer.

However, it seems to me that the python docs (or more likely me) is slightly incorrect regarding the \b code. The section I am referring to in the python docs regarding \b is this:

For example, r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

(Link to the page http://docs.python.org/2/library/re.html )

I cannot understand how 'bar foo baz' is a match? For example, if I create this code:

import re

m = re.search(r'\bfoo\b', 'bar foo baz')
m.group()

...then I get this result from the console:

'foo'

... and not

'bar foo baz'

In fact based on the rest of the explanation about '\b' in the python docs, I would actually expect 'foo' to print to the console since it matches the empty string at the beginning and the end of a word.

So, what is the deal in the python docs that 'bar foo baz' is a match?

Edit: I am using python 2.7

like image 477
Darren Haynes Avatar asked Sep 18 '13 01:09

Darren Haynes


1 Answers

I would actually expect 'foo' to print to the console since it matches the empty string at the beginning and the end of a word.

Did you mean to write ' foo ', with space on each end? It doesn't capture the spaces because \b matches transitions, gaps between characters, not characters themselves.


Some ramblings on the way regex works

The regex system treats strings like a stream of "tokens", where there is not a 1:1 relationship between a token and a character in a text-file. Expressions like \bfoo\b are simply a super short way to write rules for a Pac-Man-like robot which travels along eating things.

For example, suppose we have foo b4r b@z. The token-stream might be something like:

misc    :  start_of_string
misc    :  word_boundary
letter  :  'f'
letter  :  'o'
letter  :  'o'
misc    :  word_boundary
wspace  :  ' '
misc    :  word_boundary
letter  :  'b'
number  :  '4'
letter  :  'r'
misc    :  word_boundary
wspace  :  ' '
misc    :  word_boundary
letter  :  'b'
misc    :  word_boundary
char    :  '@'
misc    :  word_boundary
letter  :  'z'
misc    :  word_boundary
misc    :  end_of_string

When you do re.search(r'\bfoo\b',str), that eventually becomes a set of rules for pac-man to follow, roughly like:

  1. Start at the beginning.
  2. Ignore things until you find a misc:word_boundary.
  3. Eat the misc:word_boundary and remember your current position as $N.
  4. Try to eat a letter:'f'. If you can't, spit everything up, travel to $N+1, and go to rule #2.
  5. Try to eat a letter:'o'. If you can't, spit everything up, travel to $N+1, and go to rule #2.
  6. Try to eat a letter:'o'. If you can't, spit everything up, travel to $N+1, and go to rule #2.
  7. Try to eat a misc:'word_boundary'. If you can't, spit everything up, travel to $N+1, and go to rule #2.
  8. Tell me what's in your stomach now.

Obviously there's a lot more complexity you can layer on there, such as with loops (+*?) and shorthand (like \w for "a or b or c or ...") or how it selectively ignores some tokens, but hopefully the basic style is revealed.

So... can I parse HTML/XML now?

Short answer? No. Pac-man only operates on lines of stuff, but XML is like a tree. Pac-man would have to stop at certain points and hire some pac-men to explore for him (with their own different set of rules) and report back. Those sub-contractors would have sub-sub-contractors of their own too...

Anyway, Pac-man's people-skills are stunted after living in an inescapable maze full of deadly ghosts and performance-enhancing drugs. You can't get very far in a Pac-Corp when all you can say is Wakka.

like image 166
14 revs Avatar answered Oct 28 '22 02:10

14 revs