Why is the same character encoding to different bytes in different parts of my code base?
I have a unit test that generates a temporary file tree and then checks to make sure my scan actually finds the file in question.
def test_unicode_file_name():
test_regex = "é"
file_tree = {"files": ["é"]} # File created with python.open()
with TempTree(file_tree) as tmp_tree:
import pdb; pdb.set_trace()
result = tasks.find_files(test_regex, root_path=tmp_tree.root_path)
expected = [os.path.join(tmp_tree.root_path, "é")]
assert result == expected
for dir_entry in scandir(current_path):
if dir_entry.is_dir():
dirs_to_search.append(dir_entry.path)
if dir_entry.is_file():
testing = dir_entry.name
if filename_regex.match(testing):
results.append(dir_entry.path)
When I started digging into things I found that the test character (copied from my unit test) and the character in dir_entry.name
encoded to different bytes.
(Pdb) testing
'é'
(Pdb) 'é'
'é'
(Pdb) testing == 'é'
False
(Pdb) testing in 'é'
False
(Pdb) type(testing)
<class 'str'>
(Pdb) type('é')
<class 'str'>
(Pdb) repr(testing)
"'é'"
(Pdb) repr('é')
"'é'"
(Pdb) 'é'.encode("utf-8")
b'\xc3\xa9'
(Pdb) testing.encode("utf-8")
b'e\xcc\x81'
There are three different Unicode character encodings: UTF-8, UTF-16 and UTF-32. Of these three, only UTF-8 should be used for Web content.
A character in a str represents one Unicode character. However, to represent more than 256 characters, individual Unicode encodings use more than one byte per character to represent many characters. bytes objects give you access to the underlying bytes.
As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need.
Your operating system (MacOS, at a guess) has converted the filename 'é'
to Unicode Normal Form D, decomposing it into an unaccented 'e'
and a combining acute accent. You can see this clearly with a quick session in the Python interpreter:
>>> import unicodedata
>>> e1 = b'\xc3\xa9'.decode()
>>> e2 = b'e\xcc\x81'.decode()
>>> [unicodedata.name(c) for c in e1]
['LATIN SMALL LETTER E WITH ACUTE']
>>> [unicodedata.name(c) for c in e2]
['LATIN SMALL LETTER E', 'COMBINING ACUTE ACCENT']
To ensure that you're comparing like with like, you can convert the filename given by dir_entry.name
back to Normal Form C before testing it against your regex:
import unicodedata
for dir_entry in scandir(current_path):
if dir_entry.is_dir():
dirs_to_search.append(dir_entry.path)
if dir_entry.is_file():
testing = unicodedata.normalize('NFC', dir_entry.name)
if filename_regex.match(testing):
results.append(dir_entry.path)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With