Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python unicode regular expression matching failing with some unicode characters -bug or mistake?

I am attempting to use the re module in Python 2.7.3 with Unicode encoded Devnagari text. I have added from __future__ import unicode_literals to the top of my code so all strings literals should be unicode objects.

However, I am running into some odd problems with Python's regex matching. For instance, consider this name: "किशोरी". This is a (mis-spelled) name, in Hindi, entered by one of my users. Any Hindi reader would recognise this as a word.

The following returns a match, as it should:

re.search("^[\w\s][\w\s]*","किशोरी",re.UNICODE)

But this does not:

re.search("^[\w\s][\w\s]*$","किशोरी",re.UNICODE)

Some spelunking revealed that only one character in this string, character 0915 (क), is recognised as falling within the \w character class. This is incorrect, as the Unicode Character Database file on "derived core properties" lists other characters (I have not checked all) in this string as alphabetic ones - as indeed they are.

Is this just a bug in Python's implementation? I could get around this by manually defining all the Devnagari alphanumeric characters as a character range, but that would be painful. Or am I doing something wrong?

like image 938
ShankarG Avatar asked Oct 05 '12 12:10

ShankarG


3 Answers

It is a bug in the re module and it is fixed in the regex module:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import unicodedata
import re
import regex  # $ pip install regex

word = "किशोरी"


def test(re_):
    assert re_.search("^\\w+$", word, flags=re_.UNICODE)

print([unicodedata.category(cp) for cp in word])
print(" ".join(ch for ch in regex.findall("\\X", word)))
assert all(regex.match("\\w$", c) for c in ["a", "\u093f", "\u0915"])

test(regex)
test(re)  # fails

The output shows that there are 6 codepoints in "किशोरी", but only 3 user-perceived characters (extended grapheme clusters). It would be wrong to break a word inside a character. Unicode Text Segmentation says:

Word boundaries, line boundaries, and sentence boundaries should not occur within a grapheme cluster: in other words, a grapheme cluster should be an atomic unit with respect to the process of determining these other boundaries.

here and further emphasis is mine

A word boundary \b is defined as a transition from \w to \W (or in reverse) in the docs:

Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string, ...

Therefore either all codepoints that form a single character are \w or they are all \W. In this case "किशोरी" matches ^\w{6}$.


From the docs for \w in Python 2:

If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

in Python 3:

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore.

From regex docs:

Definition of 'word' character (issue #1693050):

The definition of a 'word' character has been expanded for Unicode. It now conforms to the Unicode specification at http://www.unicode.org/reports/tr29/. This applies to \w, \W, \b and \B.

According to unicode.org U+093F (DEVANAGARI VOWEL SIGN I) is alnum and alphabetic so regex is also correct to consider it \w even if we follow definitions that are not based on word boundaries.

like image 200
jfs Avatar answered Oct 29 '22 16:10

jfs


From Character Map:

‍ि

U+093F DEVANAGARI VOWEL SIGN I

General Character Properties

In Unicode since: 1.1 Unicode category: Mark, Spacing Combining

So, technically speaking this is not a letter and doesn't fall under \w even with re.UNICODE. You can try using regex with Unicode character properties instead in order to include these sorts of characters.

like image 3
Ignacio Vazquez-Abrams Avatar answered Oct 29 '22 16:10

Ignacio Vazquez-Abrams


I tested the following:

import unicodedata
for c in "किशोरी":
    print unicodedata.category(c)
    print unicodedata.name(c)

which displays in my case:

Lo
DEVANAGARI LETTER KA
Mc
DEVANAGARI VOWEL SIGN I
Lo
DEVANAGARI LETTER SHA
Mc
DEVANAGARI VOWEL SIGN O
Lo
DEVANAGARI LETTER RA
Mc
DEVANAGARI VOWEL SIGN II

Unicode stuff is hard to debug because copy and paste can mess up the data and I don't know hindi. But in some languages you can encode characters in different ways in unicode. Is it possible, that you have to normalize your string somehow before matching? To me it looks ok that a vowel sign is not matched by \w.

like image 2
Achim Avatar answered Oct 29 '22 17:10

Achim