Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Character Encoding issue in Rails v3/Ruby 1.9.2

I get this error sometimes "invalid byte sequence in UTF-8" when I read contents from a file. Note - this only happens when there are some special characters in the string. I have tried opening the file without "r:UTF-8", but still get the same error.

open(file, "r:UTF-8").each_line { |line| puts line.strip(",") } # line.strip generates the error

Contents of the file:

# encoding: UTF-8
290919,"SE","26","Sk‰l","",59.4500,17.9500,, # this errors out
290956,"CZ","45","HornÌ Bradlo","",49.8000,15.7500,, # this errors out
290958,"NO","02","Svaland","",58.4000,8.0500,, # this works

This is the CSV file I got from outside and I am trying to import it into my DB, it did not come with "# encoding: UTF-8" at the top, but I added this since I read somewhere it will fix this problem, but it did not. :(

Environment:

  • Rails v3.0.3
  • ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.5.0]
like image 703
kapso Avatar asked Jan 15 '11 00:01

kapso


2 Answers

Ruby has a notion of an external encoding and internal encoding for each file. This allows you to work with a file in UTF-8 in your source, even when the file is stored in a more esoteric format. If your default external encoding is UTF-8 (which it is if you're on Mac OS X), all of your file I/O is going to be in UTF-8 as well. You can check this using File.open('file').external_encoding. What you're doing when you opening your file and passing "r:UTF-8" is forcing the same external encoding that Ruby is using by default.

Chances are, your source document isn't in UTF-8 and those non-ascii characters aren't mapping cleanly to UTF-8 (if they were, you would either get the correct characters and no error, and if they mapped by incorrectly, you would get incorrect characters and no error). What you should do is try to determine the encoding of the source document, then have Ruby transcode the document on read, like so:

File.open(file, "r:windows-1251:utf-8").each_line { |line| puts line.strip(",") }

If you need help determining the encoding of the source, give this Python library a whirl. It's based on the automatic charset detection fallback that was in Seamonkey/Mozilla (and is possibly still in Firefox).

like image 122
coreyward Avatar answered Nov 02 '22 20:11

coreyward


If you want to change your file encoding, you can use gem 'charlock holmes'

https://github.com/brianmario/charlock_holmes

$require 'charlock_holmes/string'
content = File.read('test2.txt')
if !content.is_utf8?
  detection = CharlockHolmes::EncodingDetector.detect(content)
  utf8_encoded_content = CharlockHolmes::Converter.convert content, detection[:encoding], 'UTF-8'
end

Then you can save your new content in a temp file and overwrite your original file.
Hope this help.

like image 35
Olivier Grimard Avatar answered Nov 02 '22 19:11

Olivier Grimard