Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid tripping over UTF-8 BOM when reading files

I'm consuming a data feed that has recently added a Unicode BOM header (U+FEFF), and my rake task is now messed up by it.

I can skip the first 3 bytes with file.gets[3..-1] but is there a more elegant way to read files in Ruby which can handle this correctly, whether a BOM is present or not?

like image 848
Andrew Vit Avatar asked Feb 12 '09 20:02

Andrew Vit


People also ask

What is the difference between UTF-8 and UTF-8 with BOM?

There is no official difference between UTF-8 and BOM-ed UTF-8. A BOM-ed UTF-8 string will start with the three following bytes. EF BB BF. Those bytes, if present, must be ignored when extracting the string from the file/stream.

Does UTF-8 use 8bits?

UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.


2 Answers

With ruby 1.9.2 you can use the mode r:bom|utf-8

text_without_bom = nil #define the variable outside the block to keep the data File.open('file.txt', "r:bom|utf-8"){|file|   text_without_bom = file.read } 

or

text_without_bom = File.read('file.txt', encoding: 'bom|utf-8') 

or

text_without_bom = File.read('file.txt', mode: 'r:bom|utf-8') 

It doesn't matter, if the BOM is available in the file or not.


You may also use the encoding option with other commands:

text_without_bom = File.readlines(@filename, "r:utf-8") 

(You get an array with all lines).

Or with CSV:

require 'csv' CSV.open(@filename, 'r:bom|utf-8'){|csv|   csv.each{ |row| p row } } 
like image 185
knut Avatar answered Sep 21 '22 13:09

knut


I wouldn't blindly skip the first three bytes; what if the producer stops adding the BOM again? What you should do is examine the first few bytes, and if they're 0xEF 0xBB 0xBF, ignore them. That's the form the BOM character (U+FEFF) takes in UTF-8; I prefer to deal with it before trying to decode the stream because BOM handling is so inconsistent from one language/tool/framework to the next.

In fact, that's how you're supposed to deal with a BOM. If a file has been served as UTF-16, you have to examine the first two bytes before you start decoding so you know whether to read it as big-endian or little-endian. Of course, the UTF-8 BOM has nothing to do with byte order, it's just there to let you know that the encoding is UTF-8, in case you didn't already know that.

like image 41
Alan Moore Avatar answered Sep 21 '22 13:09

Alan Moore