Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Isn’t on big endian machines UTF-8's byte order different than on little endian machines? So why then doesn’t UTF-8 require a BOM?

Tags:

unicode

utf-8

UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order.

If Utf-8 stored all code-points in a single byte, then it would make sense why endianness doesn’t play any role and thus why BOM isn’t required. But since code points 128 and above are stored using 2, 3 and up to 6 bytes, which means their byte order on big endian machines is different than on little endian machines, so how can we claim Utf-8 always has the same byte order?

Thank you

EDIT:

UTF-8 is byte oriented

I understand that if two byte UTF-8 character C consists of bytes B1 and B2 ( where B1 is first byte and B2 is last byte ), then with UTF-8 those two bytes are always written in the same order ( thus if this character is written to a file on little endian machine LEM, B1 will be first and B2 last. Similarly, if C is written to a file on big endian machine BEM, B1 will still be first and B2 still last).

But what happens when C is written to file F on LEM, but we copy F to BEM and try to read it there? Since BEM automatically swaps bytes ( B1 is now last and B2 first byte ), how will app ( running on BEM ) reading F know whether F was created on BEM and thus order of two bytes wasn’t swapped or whether F was transferred from LEM, in which case BEM automatically swapped the bytes?

I hope question made some sense

EDIT 2:

In response to your edit: big-endian machines do not swap bytes if you ask them to read a byte at a time.

a) Oh, so even though character C is 2 bytes longs, app ( residing on BEM ) reading F will read into memory just one byte at the time ( thus it will first read into memory B1 and only then B2 )

b)

In UTF-8, you decide what to do with a byte based on its high-order bits

Assuming file F has two consequent characters C and C1 ( where C consists of bytes B1 and B2 while C1 has bytes B3, B4 and B5 ). How will app reading F know which bytes belong together simply by checking each byte's high-order bits ( for example, how will it figure out that B1 and B2 taken together should represent a character and not B1,*B2* and B3)?

If you believe that you're seeing something different, please edit your question and include

I’m not saying that. I simply didn’t understand what was going on

c)Why aren't Utf-16 and Utf-32 also byte oriented?

like image 641
user437291 Avatar asked Sep 30 '10 18:09

user437291


People also ask

What is difference between big-endian and little-endian byte order?

Big-endian is an order in which the "big end" (most significant value in the sequence) is stored first, at the lowest storage address. Little-endian is an order in which the "little end" (least significant value in the sequence) is stored first.

Is UTF-8 big or little-endian?

UTF-8 uses 3 bytes to present the same character, but it does not have big or little endian.

Does little endianness and big endianness make any difference if all the memory operations are of size byte?

This is called endianness and it refers to the ordering of the bytes. Specifically, little-endian is when the least significant bytes are stored before the more significant bytes, and big-endian is when the most significant bytes are stored before the less significant bytes.

How do you know if a machine is little or big-endian?

If it is little-endian, it would be stored as “01 00 00 00”. The program checks the first byte by dereferencing the cptr pointer. If it equals to 0, it means the processor is big-endian(“00 00 00 01”), If it equals to 1, it means the processor is little-endian (“01 00 00 00”).


2 Answers

The byte order is different on big endian vs little endian machines for words/integers larger than a byte.

e.g. on a big-endian machine a short integer of 2 bytes stores the 8 most significant bits in the first byte, the 8 least significant bits in the second byte. On a little-endian machine the 8 most significant bits will the second byte, the 8 least significant bits in the first byte.

So, if you write the memory content of such a short int directly to a file/network, the byte ordering within the short int will be different depending on the endianness.

UTF-8 is byte oriented, so there's not an issue regarding endianness. the first byte is always the first byte, the second byte is always the second byte etc. regardless of endianness.

like image 142
nos Avatar answered Sep 28 '22 01:09

nos


To answer c): UTF-16 and UTF-32 represent characters as 16-bit or 32-bit words, so they are not byte-oriented.

For UTF-8, the smallest unit is a byte, thus it is byte-oriented. The alogrithm reads or writes one byte at a time. A byte is represented the same way on all machines.

For UTF-16, the smallest unit is a 16-bit word, and for UTF-32, the smallest unit is a 32-bit word. The algorithm reads or writes one word at a time (2 bytes, or 4 bytes). The order of the bytes in each word is different on big-endian and little-endian machines.

like image 29
Chad Avatar answered Sep 28 '22 00:09

Chad