As a result of getting help with a question I had yesterday - Python 2.7 - find and replace from text file, using dictionary, to new text file - I started learning regular expressions today to understand the regular expressions code that @Blckknght had kindly created for me in his answer.
However, it seems to me that the python docs (or more likely me) is slightly incorrect regarding the \b
code. The section I am referring to in the python docs regarding \b is this:
For example, r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.
(Link to the page http://docs.python.org/2/library/re.html )
I cannot understand how 'bar foo baz'
is a match? For example, if I create this code:
import re
m = re.search(r'\bfoo\b', 'bar foo baz')
m.group()
...then I get this result from the console:
'foo'
... and not
'bar foo baz'
In fact based on the rest of the explanation about '\b' in the python docs, I would actually expect 'foo'
to print to the console since it matches the empty string at the beginning and the end of a word.
So, what is the deal in the python docs that 'bar foo baz'
is a match?
Edit: I am using python 2.7
I would actually expect 'foo' to print to the console since it matches the empty string at the beginning and the end of a word.
Did you mean to write ' foo '
, with space on each end? It doesn't capture the spaces because \b
matches transitions, gaps between characters, not characters themselves.
The regex system treats strings like a stream of "tokens", where there is not a 1:1 relationship between a token and a character in a text-file. Expressions like \bfoo\b
are simply a super short way to write rules for a Pac-Man-like robot which travels along eating things.
For example, suppose we have foo b4r b@z
. The token-stream might be something like:
misc : start_of_string
misc : word_boundary
letter : 'f'
letter : 'o'
letter : 'o'
misc : word_boundary
wspace : ' '
misc : word_boundary
letter : 'b'
number : '4'
letter : 'r'
misc : word_boundary
wspace : ' '
misc : word_boundary
letter : 'b'
misc : word_boundary
char : '@'
misc : word_boundary
letter : 'z'
misc : word_boundary
misc : end_of_string
When you do re.search(r'\bfoo\b',str)
, that eventually becomes a set of rules for pac-man to follow, roughly like:
misc:word_boundary
. misc:word_boundary
and remember your current position as $N.letter:'f'
. If you can't, spit everything up, travel to $N+1, and go to rule #2.letter:'o'
. If you can't, spit everything up, travel to $N+1, and go to rule #2.letter:'o'
. If you can't, spit everything up, travel to $N+1, and go to rule #2.misc:'word_boundary'
. If you can't, spit everything up, travel to $N+1, and go to rule #2.Obviously there's a lot more complexity you can layer on there, such as with loops (+*?
) and shorthand (like \w
for "a
or b
or c
or ...") or how it selectively ignores some tokens, but hopefully the basic style is revealed.
Short answer? No. Pac-man only operates on lines of stuff, but XML is like a tree. Pac-man would have to stop at certain points and hire some pac-men to explore for him (with their own different set of rules) and report back. Those sub-contractors would have sub-sub-contractors of their own too...
Anyway, Pac-man's people-skills are stunted after living in an inescapable maze full of deadly ghosts and performance-enhancing drugs. You can't get very far in a Pac-Corp when all you can say is Wakka.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With