In order to match the Unicode word boundaries [as defined in the Annex #29] in Python, I have been using the regex
package with flags regex.WORD | regex.V1
(regex.UNICODE
should be default since the pattern is a Unicode string) in the following way:
>>> s="here are some words"
>>> regex.findall(r'\w(?:\B\S)*', s, flags = regex.V1 | regex.WORD)
['here', 'are', 'some', 'words']
It works well in this rather simple cases. However, I was wondering what is the expected behavior in case the input string contains certain punctuation. It seems to me that WB7 says that for example the apostrophe in x'z
does not qualify as a word boundary which seems to be indeed the case:
>>> regex.findall(r'\w(?:\B\S)*', "x'z", flags = regex.V1 | regex.WORD)
["x'z"]
However, if there is a vowel, the situation changes:
>>> regex.findall(r'\w(?:\B\S)*', "l'avion", flags = regex.V1 | regex.WORD)
["l'", 'avion']
This would suggest that the regex module implements the rule WB5a
mentioned in the standard in the Notes section. However, this rule also says that the behavior should be the same with \u2019
(right single quotation mark) which I can't reproduce:
>>> regex.findall(r'\w(?:\B\S)*', "l\u2019avion", flags = regex.V1 | regex.WORD)
['l’avion']
Moreover, even with "normal" apostrophe, a ligature (or y
) seems to behave as a "non-vowel":
>>> regex.findall(r'\w(?:\B\S)*', "l'œil", flags = regex.V1 | regex.WORD)
["l'œil"]
>>> regex.findall(r'\w(?:\B\S)*', "J'y suis", flags = regex.V1 | regex.WORD)
["J'y", 'suis']
Is this the expected behavior? (all examples above were executed with regex 2.4.106 and Python 3.5.2)
Word boundaries are determined by the current locale if the LOCALE flag is used. Inside a character range, \b represents the backspace character, for compatibility with Python's string literals. \B. Matches the empty string, but only when it is not at the beginning or end of a word.
The (\b ) is an anchor like the caret ( ^ ) and the dollar sign ( $ ). It matches a position that is called a “word boundary”. The word boundary match is zero-length.
1- RIGHT SINGLE QUOTATION MARK ’
seems to be just simply missed in source file:
/* Break between apostrophe and vowels (French, Italian). */
/* WB5a */
if (pos_m1 >= 0 && char_at(state->text, pos_m1) == '\'' &&
is_unicode_vowel(char_at(state->text, text_pos)))
return TRUE;
2- Unicode vowels are determined with is_unicode_vowel()
function which translates to this list:
a, à, á, â, e, è, é, ê, i, ì, í, î, o, ò, ó, ô, u, ù, ú, û
So a LATIN SMALL LIGATURE OE œ
character is not considered as a unicode vowel:
Py_LOCAL_INLINE(BOOL) is_unicode_vowel(Py_UCS4 ch) {
#if PY_VERSION_HEX >= 0x03030000
switch (Py_UNICODE_TOLOWER(ch)) {
#else
switch (Py_UNICODE_TOLOWER((Py_UNICODE)ch)) {
#endif
case 'a': case 0xE0: case 0xE1: case 0xE2:
case 'e': case 0xE8: case 0xE9: case 0xEA:
case 'i': case 0xEC: case 0xED: case 0xEE:
case 'o': case 0xF2: case 0xF3: case 0xF4:
case 'u': case 0xF9: case 0xFA: case 0xFB:
return TRUE;
default:
return FALSE;
}
}
This bug is now fixed in regex 2016.08.27 after a bug report. [_regex.c:#1668]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With