Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching Unicode word boundaries in Python

In order to match the Unicode word boundaries [as defined in the Annex #29] in Python, I have been using the regex package with flags regex.WORD | regex.V1 (regex.UNICODE should be default since the pattern is a Unicode string) in the following way:

>>> s="here are some words"
>>> regex.findall(r'\w(?:\B\S)*', s, flags = regex.V1 | regex.WORD)
['here', 'are', 'some', 'words']

It works well in this rather simple cases. However, I was wondering what is the expected behavior in case the input string contains certain punctuation. It seems to me that WB7 says that for example the apostrophe in x'z does not qualify as a word boundary which seems to be indeed the case:

>>> regex.findall(r'\w(?:\B\S)*', "x'z", flags = regex.V1 | regex.WORD)
["x'z"]

However, if there is a vowel, the situation changes:

>>> regex.findall(r'\w(?:\B\S)*', "l'avion", flags = regex.V1 | regex.WORD)
["l'", 'avion']

This would suggest that the regex module implements the rule WB5a mentioned in the standard in the Notes section. However, this rule also says that the behavior should be the same with \u2019 (right single quotation mark) which I can't reproduce:

>>> regex.findall(r'\w(?:\B\S)*', "l\u2019avion", flags = regex.V1 | regex.WORD)
['l’avion']

Moreover, even with "normal" apostrophe, a ligature (or y) seems to behave as a "non-vowel":

>>> regex.findall(r'\w(?:\B\S)*', "l'œil", flags = regex.V1 | regex.WORD)
["l'œil"]
>>> regex.findall(r'\w(?:\B\S)*', "J'y suis", flags = regex.V1 | regex.WORD)
["J'y", 'suis']

Is this the expected behavior? (all examples above were executed with regex 2.4.106 and Python 3.5.2)

like image 542
ewcz Avatar asked Aug 24 '16 20:08

ewcz


People also ask

What is word boundary in Python?

Word boundaries are determined by the current locale if the LOCALE flag is used. Inside a character range, \b represents the backspace character, for compatibility with Python's string literals. \B. Matches the empty string, but only when it is not at the beginning or end of a word.

What character's do you use to match on a word boundary?

The (\b ) is an anchor like the caret ( ^ ) and the dollar sign ( $ ). It matches a position that is called a “word boundary”. The word boundary match is zero-length.


1 Answers

1- RIGHT SINGLE QUOTATION MARK seems to be just simply missed in source file:

/* Break between apostrophe and vowels (French, Italian). */
/* WB5a */
if (pos_m1 >= 0 && char_at(state->text, pos_m1) == '\'' &&
  is_unicode_vowel(char_at(state->text, text_pos)))
    return TRUE;

2- Unicode vowels are determined with is_unicode_vowel() function which translates to this list:

a, à, á, â, e, è, é, ê, i, ì, í, î, o, ò, ó, ô, u, ù, ú, û

So a LATIN SMALL LIGATURE OE œ character is not considered as a unicode vowel:

Py_LOCAL_INLINE(BOOL) is_unicode_vowel(Py_UCS4 ch) {
#if PY_VERSION_HEX >= 0x03030000
    switch (Py_UNICODE_TOLOWER(ch)) {
#else
    switch (Py_UNICODE_TOLOWER((Py_UNICODE)ch)) {
#endif
    case 'a': case 0xE0: case 0xE1: case 0xE2:
    case 'e': case 0xE8: case 0xE9: case 0xEA:
    case 'i': case 0xEC: case 0xED: case 0xEE:
    case 'o': case 0xF2: case 0xF3: case 0xF4:
    case 'u': case 0xF9: case 0xFA: case 0xFB:
        return TRUE;
    default:
        return FALSE;
    }
}

This bug is now fixed in regex 2016.08.27 after a bug report. [_regex.c:#1668]

like image 54
revo Avatar answered Oct 22 '22 21:10

revo