Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is XML BOM and how do I detect it?

Tags:

What exactly is the BOM in a ANSI XML document and should it be removed? Should a XML document be in UTF-8 instead? Can anyone tell me a Java method that will detect the BOM? The BOM consists of the characters EF BB BF .

like image 826
djangofan Avatar asked Nov 20 '09 18:11

djangofan


People also ask

What is XML BOM?

XML > Byte Order Marker. The Byte Order Marker (BOM) is a series of byte values placed on the beginning of an encoded text stream (or file). This data allows the reader to correctly decide which character encoding to use when decoding the stream back into a sequence of characters.

How do you detect BOM?

To check if BOM character exists, open the file in Notepad++ and look at the bottom right corner. If it says UTF-8-BOM then the file contains BOM character.

Is BOM allowed in XML?

In XML files, the encoding is declared at the beginning in root processing instruction, but never the less in standard Microsoft files, this is as well included as a hidden character.

What does BOM mean in encoding?

A byte order mark (BOM) is a sequence of bytes used to indicate Unicode encoding of a text file. The underlying character code, U+FEFF , takes one of the following forms depending on the character encoding. Bytes. Encoding Form. EF BB BF.


2 Answers

For a ANSI XML file it should actually be removed. If you want to use UTF-8 you don't really need it. Only for UTF-16 and UTF-32 it is needed.

The Byte-Order-Mark (or BOM), is a special marker added at the very beginning of an Unicode file encoded in UTF-8, UTF-16 or UTF-32. It is used to indicate whether the file uses the big-endian or little-endian byte order. The BOM is mandatory for UTF-16 and UTF-32, but it is optional for UTF-8.

(Source: https://www.opentag.com/xfaq_enc.htm#enc_bom)

Regarding the question on how detect this in java.

Check the following answer to this question: Java : How to determine the correct charset encoding of a stream

Basically just read in the first few bytes yourself and then determine if you may have found a BOM.

like image 123
jitter Avatar answered Oct 20 '22 23:10

jitter


The byte order mark is likely to be one of these byte sequences:

     UTF-8 BOM: ef bb bf    UTF-16BE BOM: fe ff    UTF-16LE BOM: ff fe    UTF-32BE BOM: 00 00 fe ff    UTF-32LE BOM: ff fe 00 00  

These are the variously encoded forms of the Unicode codepoint U+FEFF. This can be expressed as a Java char literal using '\uFEFF' (Java char values are implicitly UTF-16). Since U+FEFF isn't in most encodings, it is not possible for this BOM codepoint to be encoded by them. (More on encoding the BOM using Java here.)

When it comes to BOMs and XML, they are optional (see also the Unicode BOM FAQ). Detection of encoding in XML is relatively straightforward if the encoding is specified in the declaration. Always make sure that the XML declaration (<?xml version="1.0" encoding="UTF-8"?>) matches the encoding used to write the document. If you are strict about this, parsers should be able to interpret your documents correctly. (XML spec on encoding detection.)

I advocate encoding as Unicode wherever possible (see also the 10 Commandments of Unicode). That said, XML allows the representation of any Unicode character via escape entities (e.g. 'A' could be represented by &#x0041;), so it isn't necessarily a requirement to avoid data loss.

like image 32
McDowell Avatar answered Oct 20 '22 23:10

McDowell