I'm surprised that I'm not able to match a German umlaut in a regexp. I tried several approaches, most involving setting locales, but up to now to no avail.
locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')
re.findall(r'\w+', 'abc def g\xfci jkl', re.L)
re.findall(r'\w+', 'abc def g\xc3\xbci jkl', re.L)
re.findall(r'\w+', 'abc def güi jkl', re.L)
re.findall(r'\w+', u'abc def güi jkl', re.L)
None of these versions matches the umlaut-u (ü) correctly with \w+
. Also removing the re.L
flag or prefixing the pattern string with u
(to make it unicode) did not help me.
Any ideas? How is the flag re.L
used correctly?
Have you tried to use the re.UNICODE
flag, as described in the doc?
>>> re.findall(r'\w+', 'abc def güi jkl', re.UNICODE)
['abc', 'def', 'g\xc3\xbci', 'jkl']
A quick search points to this thread that gives some explanation:
re.LOCALE just passes the character to the underlying C library. It really only works on bytestrings which have 1 byte per character. UTF-8 encodes codepoints outside the ASCII range to multiple bytes per codepoint, and the re module will treat each of those bytes as a separate character.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With