In UTF-16, UTF-16BE, UTF-16LE, is the endian of UTF-16 the computer's endianness?

Tags:

UTF-16 is a two-byte character encoding. Exchanging the two bytes' addresses will produce UTF-16BE and UTF-16LE.

But I find the name UTF-16 encoding exists in the Ubuntu gedit text editor, as well as UTF-16BE and UTF-16LE. With a C test program I found my computer is little endian, and UTF-16 is confirmed as same encoding of UTF-16LE.

Also: There are two byte orders of a value (such as integer) in little/big endian computers. Little endian computers will produce little endian values in hardware (except the value produced by Java which always forms a big endian).

While text can be saved as UTF-16LE as well as UTF-16BE in my little endian computer, are characters produced one byte by one byte (such as the ASCII string, reference to [3] and the endianness of UTF-16 just defined by the human -- not as a result of the phenomenon that big endian machines write big endian UTF-16 while little endian machines write little endian UTF-16?

http://www.ibm.com/developerworks/aix/library/au-endianc/
http://teaching.idallen.com/cst8281/10w/notes/110_byte_order_endian.html
ASCII strings and endianness
Is it true that endianness only affects the memory layout of numbers,but not string? This a post of relation between endianness of string and machine.

822

asked Apr 11 '16 13:04

hao.zhou

2 Answers

"is endian of UTF-16 the computer's endianness?"

The impact of your computer's endianness can be looked at from the point of view of a writer or a reader of a file.

If you are reading a file in a -standard- format, then the kind of machine reading it shouldn't matter. The format should be well-defined enough that no matter what the endianness of the reading machine is, the data can still be read correctly.

That doesn't mean the format can't be flexible. With "UTF-16" (when a "BE" or "LE" disambiguation is not used in the format name) the definition allows files to be marked as either big endian or little endian. This is done with something called the "Byte Order Mark" (BOM) in the first two bytes of the file:

https://en.wikipedia.org/wiki/Byte_order_mark

The existence of the BOM gives options to the writer of a file. They might choose to write out the most natural endianness for a buffer in memory, and include a BOM that matched. This wouldn't necessarily be the most efficient format for some other reader. But any program claiming UTF-16 support is supposed to be able to handle it either way.

So yes--the computer's endianness might factor into the endianness choice of a BOM-marked UTF-16 file. Still...a little-endian program is fully able to save a file, label it "UTF-16" and have it be big-endian. As long as the BOM is consistent with the data, it doesn't matter what kind of machine writes or reads it.

...what if there's no BOM?

This is where things get a little hazy.

On the one hand, the Unicode RFC 2781 and Unicode FAQ are clear. They say that a file in "UTF-16" format which starts with neither 0xFF 0xFE nor 0xFE 0xFF is to be interpreted as big endian:

the unmarked form uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used.

Yet to know if you have UTF-16-LE, UTF-16-BE, or UTF-16 file with no BOM...you need metadata outside the file telling you which of the three it is. Because there's not always a place to put that data, some programs wound up using heuristics.

Consider something like this from Raymond Chen (2007):

You might decide that programs that generate UTF-16 files without a BOM are broken, but that doesn't mean that they don't exist. For example,
cmd /u /c dir >results.txt
This generates a UTF-16LE file without a BOM.

That's a valid UTF-16LE file, but where would the "UTF-16LE" meta-label be stored? What are the odds someone passes that off by just calling it a UTF-16 file?

Empirically there are warnings about the term. The Wikipedia page for UTF-16 says:

If the BOM is missing, RFC 2781 says that big-endian encoding should be assumed. (In practice, due to Windows using little-endian order by default, many applications similarly assume little-endian encoding by default.)

And unicode.readthedocs.org says:

"UTF-16" and "UTF-32" encoding names are imprecise: depending of the context, format or protocol, it means UTF-16 and UTF-32 with BOM markers, or UTF-16 and UTF-32 in the host endian without BOM. On Windows, "UTF-16" usually means UTF-16-LE.

And further, the Byte-Order-Mark Wikipedia article says:

Clause D98 of conformance (section 3.10) of the Unicode standard states, "The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian."

Whether or not a higher-level protocol is in force is open to interpretation. Files local to a computer for which the native byte ordering is little-endian, for example, might be argued to be encoded as UTF-16LE implicitly. Therefore, the presumption of big-endian is widely ignored.

When those same files are accessible on the Internet, on the other hand, no such presumption can be made. Searching for 16-bit characters in the ASCII range or just the space character (U+0020) is a method of determining the UTF-16 byte order.

So despite the unambiguity of the standard, the context may matter in practice.

As @rici points out, the standard has been around for a while now. Still, it may pay to do double-checks on files claimed as "UTF-16". Or even to consider if you might want to avoid a lot of these issues and embrace UTF-8...

"Should UTF-16 be considered harmful?"

126

answered Nov 15 '22 20:11

HostileFork says dont trust SE

The Unicode encoding schemes are defined in section 3.10 of the Unicode standard. The standard defines seven encoding schemes:

8 bit: UTF-8
16 bit: UTF-16BE, UTF-16LE and UTF-16
32 bit: UTF-32BE, UTF-32LE and UTF-32

In the case of the 16- and 32-bit encodings, the three variants differ in endianness, which may be explicit or indicated by starting the string with a Byte Order Mark (BOM) character, U+FEFF:

The LE variant is definitely little-endian; the low-order byte is encoded first. No BOM is permitted, so an initial character U+FEFF is a zero-width no-break space.
The BE variant is definitely big-endian; the high-order byte is encoded first. As with the LE variant, no BOM is permitted, so an initial character U+FEFF is a zero-width no-break space.
The variant without an endian mark may be big- or little-endian. Normally it will start with a BOM which defines the endianness. If there is no BOM, then big-endian encoding is assumed.

If you are going to use 16- or 32-bit encoding schemes for data serialization, it is generally recommended to use the unmarked variants with an explicit BOM. However, UTF-8 is a much more common data interchange format.

Although no endian marker is needed for UTF-8, it is permitted (but not recommended) to start a UTF-8 encoded string with a BOM; this can be used to differentiate between Unicode encoding schemes. Many Windows programs do this, and a U+FEFF at the beginning of a UTF-8 transmission should probably be treated as a BOM (and thus not as Unicode data).

answered Nov 15 '22 20:11

rici

Related questions
                            
                                sqrt, perfect squares and floating point errors
                            
                                How do I handle byte order differences when reading/writing floating-point types in C?
                            
                                write vs fprintf - why different and which is better?
                            
                                Is it possible to repeat getopt
                            
                                Does the working of sizeof operator different in c andd c++ [duplicate]
                            
                                Converting integer into array of digits [closed]
                            
                                C/C++: What is the difference between a statically-linked library and an object file?
                            
                                Why's initializing a global variable with return value of a function failing at declaration,but works fine at file scope?
                            
                                Why does trivial loop in python run so much slower than the same in C++? And how to optimize that? [duplicate]
                            
                                How to implement my own system call without recompiling the Linux kernel?
                            
                                Function parameters transferred in registers on 64bit OS?
                            
                                Unaligned access causes error on ARM Cortex-M4
                            
                                What can a void variable be used for?
                            
                                /usr/bin/ld cannot find -lbsd
                            
                                Producer Consumer program using semaphores and pthreads
                            
                                Using XOR with pointers in C
                            
                                Correct usage of free() function in C
                            
                                valgrind - address is 8 bytes before a block of size 16 alloc'd
                            
                                How to make this function thread safe and fast?
                            
                                Pipes, dup2 and exec()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In UTF-16, UTF-16BE, UTF-16LE, is the endian of UTF-16 the computer's endianness?

Tags:

c

unicode

utf-16

endianness

hao.zhou

People also ask

2 Answers

HostileFork says dont trust SE

rici

Recent Activity

Donate For Us