I'm consuming a data feed that has recently added a Unicode BOM header (U+FEFF), and my rake task is now messed up by it.
I can skip the first 3 bytes with file.gets[3..-1]
but is there a more elegant way to read files in Ruby which can handle this correctly, whether a BOM is present or not?
There is no official difference between UTF-8 and BOM-ed UTF-8. A BOM-ed UTF-8 string will start with the three following bytes. EF BB BF. Those bytes, if present, must be ignored when extracting the string from the file/stream.
UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.
With ruby 1.9.2 you can use the mode r:bom|utf-8
text_without_bom = nil #define the variable outside the block to keep the data File.open('file.txt', "r:bom|utf-8"){|file| text_without_bom = file.read }
or
text_without_bom = File.read('file.txt', encoding: 'bom|utf-8')
or
text_without_bom = File.read('file.txt', mode: 'r:bom|utf-8')
It doesn't matter, if the BOM is available in the file or not.
You may also use the encoding option with other commands:
text_without_bom = File.readlines(@filename, "r:utf-8")
(You get an array with all lines).
Or with CSV:
require 'csv' CSV.open(@filename, 'r:bom|utf-8'){|csv| csv.each{ |row| p row } }
I wouldn't blindly skip the first three bytes; what if the producer stops adding the BOM again? What you should do is examine the first few bytes, and if they're 0xEF 0xBB 0xBF, ignore them. That's the form the BOM character (U+FEFF) takes in UTF-8; I prefer to deal with it before trying to decode the stream because BOM handling is so inconsistent from one language/tool/framework to the next.
In fact, that's how you're supposed to deal with a BOM. If a file has been served as UTF-16, you have to examine the first two bytes before you start decoding so you know whether to read it as big-endian or little-endian. Of course, the UTF-8 BOM has nothing to do with byte order, it's just there to let you know that the encoding is UTF-8, in case you didn't already know that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With