default encoding for XML is UTF-8 or UTF-16?

Tags:

xml-serialization

OpenTag FAQ states:

If no encoding declaration is present in the XML document (and no external encoding declaration mechanism such as the HTTP header is available), the assumed encoding of an XML document depends on the presence of the Byte-Order-Mark (BOM).

The BOM is a Unicode special marker placed at the top of the file that indicate its encoding. The BOM is optional for UTF-8.
First bytes        Encoding assumed
-----------------------------------------
EF BB BF           UTF-8
FE FF              UTF-16 (big-endian)
FF FE              UTF-16 (little-endian)
00 00 FE FF        UTF-32 (big-endian)
FF FE 00 00        UTF-32 (little-endian)
None of the above  UTF-8

Is there a dumbed-down explanation of the above paragraph?

965

asked Jun 10 '11 05:06

1 Answers

Either you have to use a line like

<?xml version="1.0" encoding="iso-8859-1" ?>

to specify which encoding is used. If the encoding is not specified, a Byte order mark (BOM) can be present. If a BOM for either UTF-16 or UTF-32 is present, that encoding is used. Otherwise UTF-8 is the encoding. (The BOM for UTF-8 is optional)

Edit

The BOM is an invisible character. But there is no need to see it. Applications take care of it automatically. When you use windows notepad, you can select the encoding when you save the file. Notepad will automatically insert the BOM at the start of the file. When you later reopen the file, notepad will recognise the BOM and use the proper encoding to read the file. There is no need for you to ever modify the BOM, if you would do so, characters can get a different meaning, so the text will not be the same.

I will try to explain with an example. Consider a text file, with just the characters "test". Default notepad will use ANSI encoding, the text file will look like this when you view it in hex mode:

C:\>C:\gnuwin32\bin\hexdump -C test-ansi.txt
00000000  74 65 73 74                                       |test|
00000004

(as you see, I am using hexdump from gnuwin32, but you can also use an hex editor like Frhed to see this.

There is no BOM in front of this file. It would not be possible, because the character which is used for the BOM does not exist in ANSI encoding. (Because there is not BOM, editors which don't support ANSI encoding, would treat this file as UTF-8).

when I now save the file like utf8, you will see 3 extra bytes (the BOM) in front of "test":

C:\>C:\gnuwin32\bin\hexdump -C test-utf8.txt
00000000  ef bb bf 74 65 73 74                              |ï»¿test|
00000007

(if you would open this file with a text editor which does not support utf-8, you would actually see those characters "ï»¿")

Notepad can also save the file as unicode, this means UTF-16 little-endian (UTF-16LE):

C:\>C:\gnuwin32\bin\hexdump -C test-unicode.txt
00000000  ff fe 74 00 65 00 73 00  74 00                    |ÿþt.e.s.t.|
0000000a

And here is the version saved as unicode (big endian) (UTF-16BE):

C:\>C:\gnuwin32\bin\hexdump -C test-unicode-big-endian.txt
00000000  fe ff 00 74 00 65 00 73  00 74                    |þÿ.t.e.s.t|
0000000a

Now consider a text file with the 4 chinese characters "琀攀猀琀". When I save that as unicode (big endian), the result looks like this:

C:\>C:\gnuwin32\bin\hexdump -C test2-unicode-big-endian.txt
00000000  fe ff 74 00 65 00 73 00  74 00                    |þÿt.e.s.t.|
0000000a

As you see, the word "test" in UTF-16LE is stored the same way as the word "琀攀猀琀" in UTF-16BE. But because the BOM if stored different, you can see whether the file contains "test" or "琀攀猀琀". Without a BOM you would have to guess.

118

answered Oct 06 '22 10:10

wimh

Related questions
                            
                                How to declare XPath namespaces in xmlstarlet?
                            
                                Java: Marshalling Object -- Removing extra ns2 annotation in xml
                            
                                How do I set the Settings property in XmlTextWriter, so that I can write each XML attribute on its own line?
                            
                                Extract data from XML Clob using SQL from Oracle Database
                            
                                Performance: XDocument versus XmlDocument
                            
                                What are the limits to code generation from XML Schema in C#?
                            
                                Parse xml in powershell
                            
                                How to create a XML file with MSBuild?
                            
                                Possible to add a XSLT Stylesheet to a serialized XML document?
                            
                                jQuery parsing XML: get an element with a specific attribute
                            
                                xpath dates comparison
                            
                                Use Netbeans to Create Sample XML from XSD
                            
                                How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?
                            
                                javax.xml.bind.UnmarshalException: unexpected element (uri:"", local:""). Expected elements are
                            
                                XPath for elements with attribute not equal or does not exist
                            
                                Why is JSON replacing XML as a data format? [closed]
                            
                                How to get String representation of XmlType?
                            
                                JAXB compiler is binding xs:boolean to Java Boolean wrapper class, instead of boolean primitive type
                            
                                CollapsingToolbarLayout | Scrolling and layout issues 2
                            
                                How to feed Boost.PropertyTree with a string, not a file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

default encoding for XML is UTF-8 or UTF-16?

Tags:

xml

xml-serialization

Pacerier

People also ask

1 Answers

wimh

Recent Activity

Donate For Us