The Python re module's documentation says that when the re.UNICODE flag is set, '\s' will match:
whatever is classified as space in the Unicode character properties database.
As far I can tell, the BOM (U+FEFF) is classified as a space.
However:
re.match(u'\s', u'\ufeff', re.UNICODE)
evaluates to None.
Is this a bug in Python or am I missing something?
U+FEFF is not a whitespace character according to the unicode database.
Wikipedia only lists it as it is a "related character". These are similar to whitespace characters but don't have the WSpace property in the database.
None of those characters are matched by \s.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With