I'm consuming a data feed that has recently added a Unicode BOM header (U+FEFF), and my rake task is now messed up by it. I can skip the first 3 bytes with <code>file.gets[3..-1]</code> but is there a more elegant way to read files in Ruby which can handle this correctly, whether a BOM is present or not?

With ruby 1.9.2 you can use the mode <code>r:bom|utf-8</code> <pre class="prettyprint"><code>text_without_bom = nil #define the variable outside the block to keep the data File.open('file.txt', "r:bom|utf-8"){|file| text_without_bom = file.read } </code></pre> or <pre class="prettyprint"><code>text_without_bom = File.read('file.txt', encoding: 'bom|utf-8') </code></pre> or <pre class="prettyprint"><code>text_without_bom = File.read('file.txt', mode: 'r:bom|utf-8') </code></pre> It doesn't matter, if the BOM is available in the file or not. <hr> You may also use the encoding option with other commands: <pre class="prettyprint"><code>text_without_bom = File.readlines(@filename, "r:utf-8") </code></pre> (You get an array with all lines). Or with CSV: <pre class="prettyprint"><code>require 'csv' CSV.open(@filename, 'r:bom|utf-8'){|csv| csv.each{ |row| p row } } </code></pre>

I wouldn't blindly skip the first three bytes; what if the producer stops adding the BOM again? What you should do is examine the first few bytes, and if they're 0xEF 0xBB 0xBF, ignore them. That's the form the BOM character (U+FEFF) takes in UTF-8; I prefer to deal with it before trying to decode the stream because BOM handling is so inconsistent from one language/tool/framework to the next. In fact, that's how you're supposed to deal with a BOM. If a file has been served as UTF-16, you have to examine the first two bytes before you start decoding so you know whether to read it as big-endian or little-endian. Of course, the UTF-8 BOM has nothing to do with byte order, it's just there to let you know that the encoding is UTF-8, in case you didn't already know that.

How to avoid tripping over UTF-8 BOM when reading files

2 Answers

With ruby 1.9.2 you can use the mode r:bom|utf-8

text_without_bom = nil #define the variable outside the block to keep the data File.open('file.txt', "r:bom|utf-8"){|file|   text_without_bom = file.read }

text_without_bom = File.read('file.txt', encoding: 'bom|utf-8')

text_without_bom = File.read('file.txt', mode: 'r:bom|utf-8')

It doesn't matter, if the BOM is available in the file or not.

You may also use the encoding option with other commands:

text_without_bom = File.readlines(@filename, "r:utf-8")

(You get an array with all lines).

Or with CSV:

require 'csv' CSV.open(@filename, 'r:bom|utf-8'){|csv|   csv.each{ |row| p row } }

185

answered Sep 21 '22 13:09

knut

I wouldn't blindly skip the first three bytes; what if the producer stops adding the BOM again? What you should do is examine the first few bytes, and if they're 0xEF 0xBB 0xBF, ignore them. That's the form the BOM character (U+FEFF) takes in UTF-8; I prefer to deal with it before trying to decode the stream because BOM handling is so inconsistent from one language/tool/framework to the next.

In fact, that's how you're supposed to deal with a BOM. If a file has been served as UTF-16, you have to examine the first two bytes before you start decoding so you know whether to read it as big-endian or little-endian. Of course, the UTF-8 BOM has nothing to do with byte order, it's just there to let you know that the encoding is UTF-8, in case you didn't already know that.

answered Sep 21 '22 13:09

Alan Moore

Related questions
                            
                                How to update a single attribute without touching the updated_at attribute?
                            
                                In Rails 4.1, how to find records by enum symbol?
                            
                                How do I create a ruby Hello world?
                            
                                Find values in common between two arrays
                            
                                How can I delete special characters?
                            
                                How to get the width of terminal window in Ruby
                            
                                Ruby Koan 151 raising exceptions
                            
                                Subtract n hours from a DateTime in Ruby
                            
                                Ruby on Rails - Can I modify data before it is saved?
                            
                                Error while installing Nokogiri (1.6.7) on El Capitan
                            
                                "rake assets:precompile" gives punc error
                            
                                url encode equivalent in ruby on rails
                            
                                Ruby operator precedence table
                            
                                How to reference global variables and class variables?
                            
                                I don't understand ruby local scope
                            
                                What's the difference between “includes” and “preload” in an ActiveRecord query?
                            
                                In Ruby what does "=>" mean and how does it work? [duplicate]
                            
                                Abstract Method in Ruby
                            
                                WARNING: Can't verify CSRF token authenticity in case of API development
                            
                                Rails: How do I write tests for a ruby module?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to avoid tripping over UTF-8 BOM when reading files

Tags:

file

ruby

unicode

byte-order-mark

Andrew Vit

People also ask

2 Answers

knut

Alan Moore

Recent Activity

Donate For Us