Why is 'é' and 'é' encoding to different bytes?

Question

Why is the same character encoding to different bytes in different parts of my code base?

Context

I have a unit test that generates a temporary file tree and then checks to make sure my scan actually finds the file in question.

def test_unicode_file_name():
    test_regex = "é"
    file_tree = {"files": ["é"]} # File created with python.open()
    with TempTree(file_tree) as tmp_tree:
        import pdb; pdb.set_trace()
        result = tasks.find_files(test_regex, root_path=tmp_tree.root_path)
        expected = [os.path.join(tmp_tree.root_path, "é")]
        assert result == expected

Function that's failing

for dir_entry in scandir(current_path):
    if dir_entry.is_dir():
        dirs_to_search.append(dir_entry.path)

    if dir_entry.is_file():
        testing = dir_entry.name
        if filename_regex.match(testing):
            results.append(dir_entry.path)

PDB Session

When I started digging into things I found that the test character (copied from my unit test) and the character in dir_entry.name encoded to different bytes.

(Pdb) testing
'é'
(Pdb) 'é'
'é'
(Pdb) testing == 'é'
False
(Pdb) testing in 'é'
False
(Pdb) type(testing)
<class 'str'>
(Pdb) type('é')
<class 'str'>
(Pdb) repr(testing)
"'é'"
(Pdb) repr('é')
"'é'"
(Pdb) 'é'.encode("utf-8")
b'\xc3\xa9'
(Pdb) testing.encode("utf-8")
b'e\xcc\x81'

What are the 3 types of character encoding?

There are three different Unicode character encodings: UTF-8, UTF-16 and UTF-32. Of these three, only UTF-8 should be used for Web content.

What is the difference between byte and Unicode?

A character in a str represents one Unicode character. However, to represent more than 256 characters, individual Unicode encodings use more than one byte per character to represent many characters. bytes objects give you access to the underlying bytes.

Which character encoding is best?

As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need.

Your operating system (MacOS, at a guess) has converted the filename 'é' to Unicode Normal Form D, decomposing it into an unaccented 'e' and a combining acute accent. You can see this clearly with a quick session in the Python interpreter:

>>> import unicodedata
>>> e1 = b'\xc3\xa9'.decode()
>>> e2 = b'e\xcc\x81'.decode()
>>> [unicodedata.name(c) for c in e1]
['LATIN SMALL LETTER E WITH ACUTE']
>>> [unicodedata.name(c) for c in e2]
['LATIN SMALL LETTER E', 'COMBINING ACUTE ACCENT']

To ensure that you're comparing like with like, you can convert the filename given by dir_entry.name back to Normal Form C before testing it against your regex:

import unicodedata

for dir_entry in scandir(current_path):
    if dir_entry.is_dir():
        dirs_to_search.append(dir_entry.path)

    if dir_entry.is_file():
        testing = unicodedata.normalize('NFC', dir_entry.name)
        if filename_regex.match(testing):
            results.append(dir_entry.path)

Why is 'é' and 'é' encoding to different bytes?

Tags:

python

python-3.x

unicode

normalization

Question

Context

Function that's failing

PDB Session

AlexLordThorsen

People also ask

1 Answers

Zero Piraeus

Recent Activity

Donate For Us

Why is 'é' and 'é' encoding to different bytes?

Tags:

python

python-3.x

unicode

normalization

Question

Context

Function that's failing

PDB Session

AlexLordThorsen

People also ask

1 Answers

Zero Piraeus

Related questions

Recent Activity

Donate For Us