Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

default encoding for XML is UTF-8 or UTF-16?

OpenTag FAQ states:

If no encoding declaration is present in the XML document (and no external encoding declaration mechanism such as the HTTP header is available), the assumed encoding of an XML document depends on the presence of the Byte-Order-Mark (BOM).

The BOM is a Unicode special marker placed at the top of the file that indicate its encoding. The BOM is optional for UTF-8.

First bytes        Encoding assumed
-----------------------------------------
EF BB BF           UTF-8
FE FF              UTF-16 (big-endian)
FF FE              UTF-16 (little-endian)
00 00 FE FF        UTF-32 (big-endian)
FF FE 00 00        UTF-32 (little-endian)
None of the above  UTF-8

Is there a dumbed-down explanation of the above paragraph?

like image 965
Pacerier Avatar asked Jun 10 '11 05:06

Pacerier


People also ask

What is the default encoding of XML?

UTF-8 is the default character encoding for XML documents. Character encoding can be studied in our Character Set Tutorial. UTF-8 is also the default encoding for HTML5, CSS, JavaScript, PHP, and SQL.

Does XML support UTF-16?

What encodings are supported in XML. According to the specification, all XML parsers must be capable of reading documents in at least two encodings: UTF-8 and UTF-16. Many parsers support more encodings, but these should always work.

Does XML use UTF-8?

The following encodings can be used in the XML declaration: US-ASCII. UTF-8.

What is UTF-16 in XML?

UTF stands for UCS Transformation Format, and UCS itself means Universal Character Set. The number 8 or 16 refers to the number of bits used to represent a character. They are either 8(1 to 4 bytes) or 16(2 or 4 bytes). For the documents without encoding information, UTF-8 is set by default.


1 Answers

Either you have to use a line like

<?xml version="1.0" encoding="iso-8859-1" ?>

to specify which encoding is used. If the encoding is not specified, a Byte order mark (BOM) can be present. If a BOM for either UTF-16 or UTF-32 is present, that encoding is used. Otherwise UTF-8 is the encoding. (The BOM for UTF-8 is optional)

Edit

The BOM is an invisible character. But there is no need to see it. Applications take care of it automatically. When you use windows notepad, you can select the encoding when you save the file. Notepad will automatically insert the BOM at the start of the file. When you later reopen the file, notepad will recognise the BOM and use the proper encoding to read the file. There is no need for you to ever modify the BOM, if you would do so, characters can get a different meaning, so the text will not be the same.

I will try to explain with an example. Consider a text file, with just the characters "test". Default notepad will use ANSI encoding, the text file will look like this when you view it in hex mode:

C:\>C:\gnuwin32\bin\hexdump -C test-ansi.txt
00000000  74 65 73 74                                       |test|
00000004

(as you see, I am using hexdump from gnuwin32, but you can also use an hex editor like Frhed to see this.

There is no BOM in front of this file. It would not be possible, because the character which is used for the BOM does not exist in ANSI encoding. (Because there is not BOM, editors which don't support ANSI encoding, would treat this file as UTF-8).

when I now save the file like utf8, you will see 3 extra bytes (the BOM) in front of "test":

C:\>C:\gnuwin32\bin\hexdump -C test-utf8.txt
00000000  ef bb bf 74 65 73 74                              |test|
00000007

(if you would open this file with a text editor which does not support utf-8, you would actually see those characters "")

Notepad can also save the file as unicode, this means UTF-16 little-endian (UTF-16LE):

C:\>C:\gnuwin32\bin\hexdump -C test-unicode.txt
00000000  ff fe 74 00 65 00 73 00  74 00                    |ÿþt.e.s.t.|
0000000a

And here is the version saved as unicode (big endian) (UTF-16BE):

C:\>C:\gnuwin32\bin\hexdump -C test-unicode-big-endian.txt
00000000  fe ff 00 74 00 65 00 73  00 74                    |þÿ.t.e.s.t|
0000000a

Now consider a text file with the 4 chinese characters "琀攀猀琀". When I save that as unicode (big endian), the result looks like this:

C:\>C:\gnuwin32\bin\hexdump -C test2-unicode-big-endian.txt
00000000  fe ff 74 00 65 00 73 00  74 00                    |þÿt.e.s.t.|
0000000a

As you see, the word "test" in UTF-16LE is stored the same way as the word "琀攀猀琀" in UTF-16BE. But because the BOM if stored different, you can see whether the file contains "test" or "琀攀猀琀". Without a BOM you would have to guess.

like image 118
wimh Avatar answered Oct 06 '22 10:10

wimh