The Python re
module's documentation says that when the re.UNICODE
flag is set, '\s'
will match:
whatever is classified as space in the Unicode character properties database.
As far I can tell, the BOM (U+FEFF) is classified as a space.
However:
re.match(u'\s', u'\ufeff', re.UNICODE)
evaluates to None
.
Is this a bug in Python or am I missing something?
U+FEFF is not a whitespace character according to the unicode database.
Wikipedia only lists it as it is a "related character". These are similar to whitespace characters but don't have the WSpace
property in the database.
None of those characters are matched by \s
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With