Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

re matches none: Differences in Python implementation of regex?

I'm having some trouble matching a specific pattern using the python regex library (re). I'm trying to match lines with a number (of up to 3 digits), followed a collection of words (with no space between the first word and the number) which are terminated by exactly two spaces. Some examples, with the matching string enclosed in parentheses:

test(58your own becoming )Adapted from Pyramid Text utterance 81.

(46ancestral fires )In Sumerian, a language recently supplanted by

(45lap of God )Ginzberg, Legends of the Bible, p. 1.

(9Island of the Egg )The symbolism of the cosmic egg is an integral aspect of almost every mythological tradition. In the

I'm using the following expression:

(\d+).+(  )

The relevant python code is as follows:

# the search string is `tmp`
pattern = re.compile("(\d+).+(  )")
footnotes = pattern.finditer(tmp)
for footnote in footnotes:
    # do something with each match

When I use a testing site like regexr, all the above examples match exactly as intended. However, python matches none. Is there something simple I'm missing? I've also tried passing the expression to re as a raw string. I can't seem to find anything else to try in the documentation. Any help would be greatly appreciated!

EDIT

The full string can be found here.

At this point, I'm fairly certain it has something to do with how I'm handling the string. If I read from a text file, and execute the following code, the output is empty:

with open("stone.md", "r+") as f:
    tmp = f.read()
    pattern = re.compile(r"(\d+).+  ")
    footnotes = pattern.finditer(tmp)
    for footnote in footnotes:
        print tmp[footnote.start():footnote.end()]

But, If I run:

tmp = """test58your own becoming  Adapted from Pyramid Text utterance 81."""
pattern = re.compile(r"(\d+).+  ")
footnotes = pattern.finditer(tmp)
for footnote in footnotes:
    print tmp[footnote.start():footnote.end()]

I get 58your own becoming

like image 814
avery_laird Avatar asked Feb 05 '26 12:02

avery_laird


1 Answers

You've fallen victim to Unicode homoglyphs.

Your regex contains ASCII-encoded space characters (the regular spaces you are used to). However, the full text that you are operating on contain non-breaking spaces, which in HTML is   and in Unicode U+00A0. It looks exactly like a regular space to the human eye, but it isn't an ASCII space.

Python 3.6.2 (default, Jul 20 2017, 03:52:27) 
[GCC 7.1.1 20170630] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> '  '.encode('ascii')
b'  '
>>> '  '.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
>>> '  '.encode('utf-8')
b'\xc2\xa0\xc2\xa0'

The following regex will give you what you want:

pattern = re.compile(b'(\d+).+(\xc2\xa0)'.decode('utf-8'))

What this is doing is constructing a bytes object then decoding it to a utf-8 string so re can work with it.

Or, even better, you can use \s, which matches any whitespace character (covers Unicode) in the regex flavor you're using:

pattern = re.compile('(\d+).+(\s\s)')

Why then, did the regex in your answer appear to work?

Because browsers render the non-breaking space as an ASCII space, which propagates to the browser copy paste buffer as an ASCII space.

I was only able to discover this once you had disclosed the original text file you were working on. I downloaded the raw format with wget on the URL, which preserved the Unicode spaces in the original file, something that would not have happened had I copy pasted your large text file in the browser to a file on my local computer.

Wow. This was a really fun puzzle to solve. Thanks for the question.

like image 52
JoshuaRLi Avatar answered Feb 08 '26 03:02

JoshuaRLi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!