Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex '\s' does not match unicode BOM (U+FEFF)

The Python re module's documentation says that when the re.UNICODE flag is set, '\s' will match:

whatever is classified as space in the Unicode character properties database.

As far I can tell, the BOM (U+FEFF) is classified as a space.

However:

re.match(u'\s', u'\ufeff', re.UNICODE)

evaluates to None.

Is this a bug in Python or am I missing something?

like image 691
user2771609 Avatar asked Sep 10 '15 16:09

user2771609


1 Answers

U+FEFF is not a whitespace character according to the unicode database.

Wikipedia only lists it as it is a "related character". These are similar to whitespace characters but don't have the WSpace property in the database.

None of those characters are matched by \s.

like image 166
Stefan Avatar answered Oct 22 '22 07:10

Stefan