Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is there no byte-order issue with UTF-8 strings?

This question is highly related to this one, but I'm going to formulate it in a much different way, since I cannot edit the mentioned one.

There is a claim, that the BOM is redundant in UTF-8 encoded strings, since UTF-8 is a "byte-oriented", where the smallest code unit is a byte, and you can always tell from the most significant bits of the byte whether it represents a character by itself or is only a part of a character representation. The Google JavaScript style guide requires to save files in the UTF-8 encoding without BOM, also Jukka Korpela's "Unicode Explained" states that:

In UTF-8, there is no byte order issue, since the code unit size is one octet. Therefore, using BOM serves no purpose.

Suppose, that there is a UTF-8 string consisting only of ASCII characters, let's say "abcdefgh". If I stored it on a machine with another endianness (which uses a 32-bit word), wouldn't it be changed to "dcbahgfe", since one character here is one byte, and their order is reversed (stored in the opposite order) on a machine with the opposite endianness?

If this is not the case and the order of bytes is always the same in the memory, and different only in a word (during the processing sort of), then why is the byte order important for the UTF-16 encoding? I.e. if I know, that the encoding is UTF-16 and I address byte 15, I'll know that it is the first byte of the 8th code unit in the string and I need to get the second one in order to find the character or a part of a surrogate pair represented by this code unit.

So could anyone tell me where I'm wrong? I'm pretty sure, I have some misconceptions about the endianness and/or computer hardware, so I would be very grateful if anyone explained this or gave links for further information.


UPDATE:

So, if there is a UTF-16 string, let's say abcdefgh it can be stored somehow in the memory a0b0c0d0e0f0g0h0 or 0a0b0c0d0e0f0g0h (with every two bytes swapped, this by the way I do not understand either, why two and not four). And if one is reading this string on a machine with the opposite endian-ness, even one byte at a time, he still needs to swap the bytes.

Now, if there is the same UTF-8 string abcdefgh it is stored somehow as a sequence of bytes. And the question is why don't the bytes get swapped in this case? Or if they do, why doesn't one need to swap them as he reads them? Because, as far as I understand this, to the hardware and software at this level there is no difference between encodings, this is just a sequence of bytes. So how do the bytes in UTF-16 get swapped and the bytes in UTF-8 don't?

I'm using abcdefg on purpose, to show that there can be (it is not really, I know I'm wrong, but I cannot understand why) an issue even with these simple characters, which take one byte to encode. AFAIK, in UTF-8 one can always tell a, b, c etc. from other characters looking for the most significant bit in the byte. I.e. if he is addressing byte 13 (starting from 1) and it is 01100001 it is definitely the a char. It is not known how many there are chars in the string before this one, but the fact that this is the a and not a part of some other character's encoding is clear. And now suppose that I read 4 bytes at a time and their values are a, b, c, d. How do I know the intended order?

like image 558
Dmitry Koroliov Avatar asked Dec 24 '22 23:12

Dmitry Koroliov


2 Answers

You have to realize that the endianness of the machine processing UTF-8 or UTF-16 simply doesn't matter to answer the question of why there are no byte order issues with UTF-8. All that matters is that UTF-8 and UTF-16 are byte streams. UTF-8 is based on 8-bit code units, so there's only a single way to format the byte stream: simply put one byte after the other. UTF-16, on the other hand, is based 16-bit code units. There are two ways to encode a 16-bit value in a byte stream: most significant byte first (big endian) or least significant byte first (little endian). That's why there are two flavors of UTF-16 byte streams, typically called UTF-16-BE and UTF-16-LE.

How an actual computer addresses, reads, and writes memory when processing UTF-8 is a completely unrelated question. A computer might use a weird addressing scheme that complicates UTF-8 processing, requiring byte swaps or whatever. So there might be byte order issues related to a specific architecture, but these aren't byte order issues concerning the specification of UTF-8. An implementation can be sure that there's only one way how a UTF-8 byte stream is formatted.

like image 162
nwellnhof Avatar answered Dec 31 '22 12:12

nwellnhof


32-bit word -> "dcbahgfe": You could view it that way but most processors can access memory in octets (the term is: memory is byte-addressable). So, if you have a packed data structure that is a sequence of bytes, they will have sequential addresses.

If you read and write words and view them as larger integers then you would have to pack bytes in a specific order but that's not an endianness issue, it's an arithmetic one at that level.


As far as alignment goes, it is up to compilers and heap libraries. Many will pad between structures so that each begins on an efficient address boundary.

like image 34
Tom Blodget Avatar answered Dec 31 '22 10:12

Tom Blodget