Python regex '\s' does not match unicode BOM (U+FEFF)

Question

The Python re module's documentation says that when the re.UNICODE flag is set, '\s' will match:

whatever is classified as space in the Unicode character properties database.

As far I can tell, the BOM (U+FEFF) is classified as a space.

However:

re.match(u'\s', u'\ufeff', re.UNICODE)

evaluates to None.

Is this a bug in Python or am I missing something?

Stefan · Accepted Answer

U+FEFF is not a whitespace character according to the unicode database.

Wikipedia only lists it as it is a "related character". These are similar to whitespace characters but don't have the WSpace property in the database.

None of those characters are matched by \s.

Python regex '\s' does not match unicode BOM (U+FEFF)

Tags:

python

regex

unicode

user2771609

1 Answers

Stefan

Recent Activity

Donate For Us

Python regex '\s' does not match unicode BOM (U+FEFF)

Tags:

python

regex

unicode

user2771609

1 Answers

Stefan

Related questions

Recent Activity

Donate For Us