Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can the [a-zA-Z] Python regex pattern be made to match and replace non-ASCII Unicode characters?

In the following regular expression, I would like each character in the string replaced with an 'X', but it isn't working.

In Python 2.7:

>>> import re
>>> re.sub(u"[a-zA-Z]","X","dfäg")
'XX\xc3\xa4X'

or

>>> re.sub("[a-zA-Z]","X","dfäg",re.UNICODE)
u'XX\xe4X'

In Python 3.4:

>>> re.sub("[a-zA-Z]","X","dfäg")
'XXäX'

Is it possible to somehow 'configure' the [a-zA-Z] pattern to match 'ä', 'ü', etc.? If this can't be done, how can I create a similar character range pattern between square brackets that would include Unicode characters in the usual 'full alphabet' range? I mean, in a language like German, for instance, 'ä' would be placed somewhere close to 'a' in the alphabet, so one would expect it to be included in the 'a-z' range.

like image 403
X-Mann Avatar asked Oct 14 '15 14:10

X-Mann


1 Answers

You may use

(?![\d_])\w
[^\W\d_]

If used in Python 2.x, the re.U / re.UNICODE modifier is necessary. The (?![\d_]) look-ahead is restricting the \w shorthand class so as it could not match any digits (\d) or underscores. The [^\W\d_] pattern matches any word char other than digits and underscores.

See regex demo.

A Python 3 demo:

import re
print (re.sub(r"(?![\d_])\w","X","dfäg"))
# => XXXX

print (re.sub(r"[^\W\d_]","X","dfäg"))
# => XXXX

As for Python 2:

# -*- coding: utf-8 -*-
import re
s = "dfäg"
w = re.sub(ur'(?![\d_])\w', u'X', s.decode('utf8'), 0, re.UNICODE).encode("utf8")
print(w)
like image 81
Wiktor Stribiżew Avatar answered Nov 15 '22 06:11

Wiktor Stribiżew