Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading UTF-8 - BOM marker

I'm reading a file through a FileReader - the file is UTF-8 decoded (with BOM) now my problem is: I read the file and output a string, but sadly the BOM marker is outputted too. Why this occurs?

fr = new FileReader(file); br = new BufferedReader(fr);     String tmp = null;     while ((tmp = br.readLine()) != null) {     String text;         text = new String(tmp.getBytes(), "UTF-8");     content += text + System.getProperty("line.separator"); } 

output after first line

?<style> 
like image 342
onigunn Avatar asked Feb 04 '11 12:02

onigunn


People also ask

What does UTF-8 with BOM mean?

The UTF-8 file signature (commonly also called a "BOM") identifies the encoding format rather than the byte order of the document. UTF-8 is a linear sequence of bytes and not sequence of 2-byte or 4-byte units where the byte order is important. Encoding. Encoded BOM. UTF-8.

How do I know if my BOM is UTF-8?

To check if BOM character exists, open the file in Notepad++ and look at the bottom right corner. If it says UTF-8-BOM then the file contains BOM character.

What is the difference between UTF-8 and UTF-8 with BOM?

There is no official difference between UTF-8 and BOM-ed UTF-8. A BOM-ed UTF-8 string will start with the three following bytes. EF BB BF. Those bytes, if present, must be ignored when extracting the string from the file/stream.

How do I add UTF-8 to BOM?

To Add BOM to a UTF-8 file, we can directly write Unicode \ufeff or three bytes 0xEF , 0xBB , 0xBF at the beginning of the UTF-8 file. The Unicode \ufeff represents 0xEF , 0xBB , 0xBF , read this.


2 Answers

In Java, you have to consume manually the UTF8 BOM if present. This behaviour is documented in the Java bug database, here and here. There will be no fix for now because it will break existing tools like JavaDoc or XML parsers. The Apache IO Commons provides a BOMInputStream to handle this situation.

Take a look at this solution: Handle UTF8 file with BOM

like image 79
RealHowTo Avatar answered Sep 21 '22 23:09

RealHowTo


The easiest fix is probably just to remove the resulting \uFEFF from the string, since it is extremely unlikely to appear for any other reason.

tmp = tmp.replace("\uFEFF", ""); 

Also see this Guava bug report

like image 28
finnw Avatar answered Sep 21 '22 23:09

finnw