Unicode, regular expressions and PyPy

Tags:

I wrote a program to add (limited) unicode support to Python regexes, and while it's working fine on CPython 2.5.2 it's not working on PyPy (~~1.5.0-alpha0~~ 1.8.0, implementing Python ~~2.7.1~~ 2.7.2), both running on Windows XP (Edit: as seen in the comments, @dbaupp could run it fine on Linux). I have no idea why, but I suspect it has something to do with my uses of u" and ur". The full source is here, and the relevant bits are:

# -*- coding:utf-8 -*-
import re

# Regexps to match characters in the BMP according to their Unicode category.
# Extracted from Unicode specification, version 5.0.0, source:
# http://unicode.org/versions/Unicode5.0.0/
unicode_categories = {
    ur'Pi':ur'[\u00ab\u2018\u201b\u201c\u201f\u2039\u2e02\u2e04\u2e09\u2e0c\u2e1c]',
    ur'Sk':ur'[\u005e\u0060\u00a8\u00af\u00b4\u00b8\u02c2-\u02c5\u02d2-\u02df\u02...',
    ur'Sm':ur'[\u002b\u003c-\u003e\u007c\u007e\u00ac\u00b1\u00d7\u00f7\u03f6\u204...',
    ...
    ur'Pf':ur'[\u00bb\u2019\u201d\u203a\u2e03\u2e05\u2e0a\u2e0d\u2e1d]',
    ur'Me':ur'[\u0488\u0489\u06de\u20dd-\u20e0\u20e2-\u20e4]',
    ur'Mc':ur'[\u0903\u093e-\u0940\u0949-\u094c\u0982\u0983\u09be-\u09c0\u09c7\u0...',
}

def hack_regexp(regexp_string):
    for (k,v) in unicode_categories.items():
        regexp_string = regexp_string.replace((ur'\p{%s}' % k),v)
    return regexp_string

def regex(regexp_string,flags=0):
    """Shortcut for re.compile that also translates and add the UNICODE flag

    Example usage:
        >>> from unicode_hack import regex
        >>> result = regex(ur'^\p{Ll}\p{L}*').match(u'áÇñ123')
        >>> print result.group(0)
        áÇñ
        >>> 
    """
    return re.compile(hack_regexp(regexp_string), flags | re.UNICODE)

(on PyPy there is no match in the "Example usage", so result is None)

Reiterating, the program works fine (on CPython): the Unicode data seems correct, the replace works as intended, the usage example runs ok (both via doctest and directly typing it in the command line). The source file encoding is also correct, and the coding directive in the header seems to be recognized by Python.

Any ideas of what PyPy does "different" that is breaking my code? Many things came to my head (unrecognized coding header, different encodings in the command line, different interpretations of r and u) but as far as my tests go, both CPython and PyPy seems to behave identically, so I'm clueless about what to try next.

395

asked May 06 '12 13:05

mgibsonbr

2 Answers

Why aren’t you simply using Matthew Barnett’s super-recommended regexp module instead?

It works on both Python 3 and legacy Python 2, is a drop-in replacement for re, handles all the Unicode stuff you could want, and a whole lot more.

answered Oct 20 '22 01:10

tchrist

Seems PyPy has some encoding problems, both when reading a source file (unrecognized coding header, maybe) and when inputting/outputting in the command line. I replaced my example code with the following:

>>> from unicode_hack import regex
>>> result = regex(ur'^\p{Ll}\p{L}*').match(u'áÇñ123')
>>> print result.group(0) == u'áÇñ'
True
>>>

And it kept working on CPython and failing on PyPy. Replacing the "áÇñ" for its escaped characters - u'\xe1\xc7\xf1' - OTOH did the trick:

>>> from unicode_hack import regex
>>> result = regex(ur'^\p{Ll}\p{L}*').match(u'\xe1\xc7\xf1123')
>>> print result.group(0) == u'\xe1\xc7\xf1'
True
>>>

That worked fine on both. I believe the problem is restricted to these two scenarios (source loading and command line), since trying to open an UTF-8 file using codecs.open works fine. When I try to input the string "áÇñ" in the command line, or when I load the source code of "unicode_hack.py" using codecs, I get the same result on CPython:

>>> u'áÇñ'
u'\xe1\xc7\xf1'
>>> import codecs
>>> codecs.open('unicode_hack.py','r','utf8').read()[19171:19174]
u'\xe1\xc7\xf1'

but different results on PyPy:

>>>> u'áÇñ'
u'\xa0\u20ac\xa4'
>>>> import codecs
>>>> codecs.open('unicode_hack.py','r','utf8').read()[19171:19174]
u'\xe1\xc7\xf1'

Update: Issue1139 submitted on PyPy bug tracking system, let's see how that turns out...

answered Oct 19 '22 23:10

mgibsonbr

Related questions
                            
                                Move child folder contents to parent folder in python
                            
                                Django. How to locate slow tests?
                            
                                load pyd files from a zip from embedded python
                            
                                Correct usage of fmin_l_bfgs_b for fitting model parameters
                            
                                wxpython capture keyboard events in a wx.Frame
                            
                                Different levels of logging in python
                            
                                How many common English words of 4 letters or more can you make from the letters of a given word (each letter can only be used once)
                            
                                Is there a way to avoid the linear search on this?
                            
                                Storing a list of 1 million key value pairs in python
                            
                                Why is it thread-safe to perform lazy initialization in python?
                            
                                Python: How to prepend the string 'ub' to every pronounced vowel in a string?
                            
                                Python 3: Searching A Large Text File With REGEX
                            
                                audioop.rms() - why does it differ from normal RMS?
                            
                                n-gram name analysis in non-english languages (CJK, etc)
                            
                                Combining numpy with sympy
                            
                                Artifacts when drawing primitives with pygame?
                            
                                How do I "multi-process" the itertools product module?
                            
                                Python convert style: inside or out of function?
                            
                                How to run nginx + python (without django)
                            
                                Python multi-lists iteration

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unicode, regular expressions and PyPy

Tags:

python

string

regex

unicode

pypy

mgibsonbr

People also ask

2 Answers

tchrist

mgibsonbr

Recent Activity

Donate For Us