I wrote a program to add (limited) unicode support to Python regexes, and while it's working fine on CPython 2.5.2 it's not working on PyPy (1.5.0-alpha0 1.8.0, implementing Python 2.7.1 2.7.2), both running on Windows XP (Edit: as seen in the comments, @dbaupp could run it fine on Linux). I have no idea why, but I suspect it has something to do with my uses of u"
and ur"
. The full source is here, and the relevant bits are:
# -*- coding:utf-8 -*-
import re
# Regexps to match characters in the BMP according to their Unicode category.
# Extracted from Unicode specification, version 5.0.0, source:
# http://unicode.org/versions/Unicode5.0.0/
unicode_categories = {
ur'Pi':ur'[\u00ab\u2018\u201b\u201c\u201f\u2039\u2e02\u2e04\u2e09\u2e0c\u2e1c]',
ur'Sk':ur'[\u005e\u0060\u00a8\u00af\u00b4\u00b8\u02c2-\u02c5\u02d2-\u02df\u02...',
ur'Sm':ur'[\u002b\u003c-\u003e\u007c\u007e\u00ac\u00b1\u00d7\u00f7\u03f6\u204...',
...
ur'Pf':ur'[\u00bb\u2019\u201d\u203a\u2e03\u2e05\u2e0a\u2e0d\u2e1d]',
ur'Me':ur'[\u0488\u0489\u06de\u20dd-\u20e0\u20e2-\u20e4]',
ur'Mc':ur'[\u0903\u093e-\u0940\u0949-\u094c\u0982\u0983\u09be-\u09c0\u09c7\u0...',
}
def hack_regexp(regexp_string):
for (k,v) in unicode_categories.items():
regexp_string = regexp_string.replace((ur'\p{%s}' % k),v)
return regexp_string
def regex(regexp_string,flags=0):
"""Shortcut for re.compile that also translates and add the UNICODE flag
Example usage:
>>> from unicode_hack import regex
>>> result = regex(ur'^\p{Ll}\p{L}*').match(u'áÇñ123')
>>> print result.group(0)
áÇñ
>>>
"""
return re.compile(hack_regexp(regexp_string), flags | re.UNICODE)
(on PyPy there is no match in the "Example usage", so result
is None
)
Reiterating, the program works fine (on CPython): the Unicode data seems correct, the replace works as intended, the usage example runs ok (both via doctest
and directly typing it in the command line). The source file encoding is also correct, and the coding
directive in the header seems to be recognized by Python.
Any ideas of what PyPy does "different" that is breaking my code? Many things came to my head (unrecognized coding
header, different encodings in the command line, different interpretations of r
and u
) but as far as my tests go, both CPython and PyPy seems to behave identically, so I'm clueless about what to try next.
Unicode Regular Expressions. Unicode is a character set that aims to define all characters and glyphs from all human languages, living and dead. With more and more software being required to support multiple languages, or even just any language, Unicode has been strongly gaining popularity in recent years.
A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern.
Type "cmd" in the search bar and hit Enter to open the command line. What is this? Type “ pip install regex ” (without quotes) in the command line and hit Enter again. This installs regex for your default Python installation.
Why aren’t you simply using Matthew Barnett’s super-recommended regexp
module instead?
It works on both Python 3 and legacy Python 2, is a drop-in replacement for re
, handles all the Unicode stuff you could want, and a whole lot more.
Seems PyPy has some encoding problems, both when reading a source file (unrecognized coding
header, maybe) and when inputting/outputting in the command line. I replaced my example code with the following:
>>> from unicode_hack import regex
>>> result = regex(ur'^\p{Ll}\p{L}*').match(u'áÇñ123')
>>> print result.group(0) == u'áÇñ'
True
>>>
And it kept working on CPython and failing on PyPy. Replacing the "áÇñ" for its escaped characters - u'\xe1\xc7\xf1'
- OTOH did the trick:
>>> from unicode_hack import regex
>>> result = regex(ur'^\p{Ll}\p{L}*').match(u'\xe1\xc7\xf1123')
>>> print result.group(0) == u'\xe1\xc7\xf1'
True
>>>
That worked fine on both. I believe the problem is restricted to these two scenarios (source loading and command line), since trying to open an UTF-8 file using codecs.open
works fine. When I try to input the string "áÇñ" in the command line, or when I load the source code of "unicode_hack.py" using codecs
, I get the same result on CPython:
>>> u'áÇñ'
u'\xe1\xc7\xf1'
>>> import codecs
>>> codecs.open('unicode_hack.py','r','utf8').read()[19171:19174]
u'\xe1\xc7\xf1'
but different results on PyPy:
>>>> u'áÇñ'
u'\xa0\u20ac\xa4'
>>>> import codecs
>>>> codecs.open('unicode_hack.py','r','utf8').read()[19171:19174]
u'\xe1\xc7\xf1'
Update: Issue1139 submitted on PyPy bug tracking system, let's see how that turns out...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With