I'm reading a file through a FileReader - the file is UTF-8 decoded (with BOM) now my problem is: I read the file and output a string, but sadly the BOM marker is outputted too. Why this occurs?
fr = new FileReader(file); br = new BufferedReader(fr); String tmp = null; while ((tmp = br.readLine()) != null) { String text; text = new String(tmp.getBytes(), "UTF-8"); content += text + System.getProperty("line.separator"); }
output after first line
?<style>
The UTF-8 file signature (commonly also called a "BOM") identifies the encoding format rather than the byte order of the document. UTF-8 is a linear sequence of bytes and not sequence of 2-byte or 4-byte units where the byte order is important. Encoding. Encoded BOM. UTF-8.
To check if BOM character exists, open the file in Notepad++ and look at the bottom right corner. If it says UTF-8-BOM then the file contains BOM character.
There is no official difference between UTF-8 and BOM-ed UTF-8. A BOM-ed UTF-8 string will start with the three following bytes. EF BB BF. Those bytes, if present, must be ignored when extracting the string from the file/stream.
To Add BOM to a UTF-8 file, we can directly write Unicode \ufeff or three bytes 0xEF , 0xBB , 0xBF at the beginning of the UTF-8 file. The Unicode \ufeff represents 0xEF , 0xBB , 0xBF , read this.
In Java, you have to consume manually the UTF8 BOM if present. This behaviour is documented in the Java bug database, here and here. There will be no fix for now because it will break existing tools like JavaDoc or XML parsers. The Apache IO Commons provides a BOMInputStream
to handle this situation.
Take a look at this solution: Handle UTF8 file with BOM
The easiest fix is probably just to remove the resulting \uFEFF
from the string, since it is extremely unlikely to appear for any other reason.
tmp = tmp.replace("\uFEFF", "");
Also see this Guava bug report
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With