Writing unicode regex for both Python2 and Python3

Question

I can use the ur'something' and the re.U flag in Python2 to compile a regex pattern, e.g.:

$ python2
Python 2.7.13 (default, Dec 18 2016, 07:03:39) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> pattern = re.compile(ur'(«)', re.U)
>>> s = u'«abc «def«'
>>> re.sub(pattern, r' \1 ', s)
u' \xab abc  \xab def \xab '
>>> print re.sub(pattern, r' \1 ', s)
 « abc  « def «

In Python3, I can avoid the u'something' and even the re.U flag:

$ python3
Python 3.5.2 (default, Oct 11 2016, 04:59:56) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> pattern = re.compile(r'(«)')
>>> s = u'«abc «def«'
>>> print( re.sub(pattern, r' \1 ', s))
 « abc  « def «

But the goal is to write the regex such that it supports both Python2 and Python3. And doing ur'something' in Python3 would result in a syntax error:

>>> pattern = re.compile(ur'(«)', re.U)
  File "<stdin>", line 1
    pattern = re.compile(ur'(«)', re.U)
                               ^
SyntaxError: invalid syntax

Since it's a syntax error, even checking versions before declaring the pattern wouldn't work in Python3:

>>> import sys
>>> _pattern = r'(«)' if sys.version_info[0] == 3 else ur'(«)'
  File "<stdin>", line 1
    _pattern = r'(«)' if sys.version_info[0] == 3 else ur'(«)'
                                                             ^
SyntaxError: invalid syntax

How to unicode regex to support both Python2 and Python3?

Although r' ' could easily be replaced by u' ' by dropping the literal string in this case.

There are complicated regexes that sort of requires the r' ' for sanity sake, e.g.

re.sub(re.compile(r'([^\.])(\.)([\]\)}>"\'»]*)\s*$', re.U), r'\1 \2\3 ', s)

So the solution should include the literal string r' ' usage unless there're other ways to get around it. But do note that using string literals or unicode_literals or from __future__ is undesired since it will cause tonnes of other problems, esp. in other parts of the code base that I work with, see http://python-future.org/unicode_literals.html

For specific reason why the code base discourages unicode_literals import but uses the r' ' notation is because filled with it and making changes to each one of them is going to be extremely painful, e.g.

https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py
https://github.com/nltk/nltk/blob/develop/nltk/tokenize/moses.py

cco · Accepted Answer

Do you really need raw strings? For your example, a unicode string is needed, but not a raw string. Raw strings are a convenience, but not required - just double any \ you would use in the raw string and use plain unicode.

Python 2 allows concatenating a raw string with a unicode string (resulting in a unicode string), so you could use r'([^\.])(\.)([\]\)}>"\'' u'»' r']*)\s*$'
In Python 3, they will all be unicode, so that will work too.

Writing unicode regex for both Python2 and Python3

Tags:

python

regex

python-3.x

unicode

python-2.7

alvas

1 Answers

cco

Recent Activity

Donate For Us

Writing unicode regex for both Python2 and Python3

Tags:

python

regex

python-3.x

unicode

python-2.7

alvas

1 Answers

cco

Related questions

Recent Activity

Donate For Us