Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is 'é' and 'é' encoding to different bytes?

Question

Why is the same character encoding to different bytes in different parts of my code base?

Context

I have a unit test that generates a temporary file tree and then checks to make sure my scan actually finds the file in question.

def test_unicode_file_name():
    test_regex = "é"
    file_tree = {"files": ["é"]} # File created with python.open()
    with TempTree(file_tree) as tmp_tree:
        import pdb; pdb.set_trace()
        result = tasks.find_files(test_regex, root_path=tmp_tree.root_path)
        expected = [os.path.join(tmp_tree.root_path, "é")]
        assert result == expected

Function that's failing

for dir_entry in scandir(current_path):
    if dir_entry.is_dir():
        dirs_to_search.append(dir_entry.path)

    if dir_entry.is_file():
        testing = dir_entry.name
        if filename_regex.match(testing):
            results.append(dir_entry.path)

PDB Session

When I started digging into things I found that the test character (copied from my unit test) and the character in dir_entry.name encoded to different bytes.

(Pdb) testing
'é'
(Pdb) 'é'
'é'
(Pdb) testing == 'é'
False
(Pdb) testing in 'é'
False
(Pdb) type(testing)
<class 'str'>
(Pdb) type('é')
<class 'str'>
(Pdb) repr(testing)
"'é'"
(Pdb) repr('é')
"'é'"
(Pdb) 'é'.encode("utf-8")
b'\xc3\xa9'
(Pdb) testing.encode("utf-8")
b'e\xcc\x81'
like image 408
AlexLordThorsen Avatar asked Sep 20 '16 00:09

AlexLordThorsen


People also ask

What are the 3 types of character encoding?

There are three different Unicode character encodings: UTF-8, UTF-16 and UTF-32. Of these three, only UTF-8 should be used for Web content.

What is the difference between byte and Unicode?

A character in a str represents one Unicode character. However, to represent more than 256 characters, individual Unicode encodings use more than one byte per character to represent many characters. bytes objects give you access to the underlying bytes.

Which character encoding is best?

As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need.


1 Answers

Your operating system (MacOS, at a guess) has converted the filename 'é' to Unicode Normal Form D, decomposing it into an unaccented 'e' and a combining acute accent. You can see this clearly with a quick session in the Python interpreter:

>>> import unicodedata
>>> e1 = b'\xc3\xa9'.decode()
>>> e2 = b'e\xcc\x81'.decode()
>>> [unicodedata.name(c) for c in e1]
['LATIN SMALL LETTER E WITH ACUTE']
>>> [unicodedata.name(c) for c in e2]
['LATIN SMALL LETTER E', 'COMBINING ACUTE ACCENT']

To ensure that you're comparing like with like, you can convert the filename given by dir_entry.name back to Normal Form C before testing it against your regex:

import unicodedata

for dir_entry in scandir(current_path):
    if dir_entry.is_dir():
        dirs_to_search.append(dir_entry.path)

    if dir_entry.is_file():
        testing = unicodedata.normalize('NFC', dir_entry.name)
        if filename_regex.match(testing):
            results.append(dir_entry.path)
like image 129
Zero Piraeus Avatar answered Sep 21 '22 17:09

Zero Piraeus