I get this error sometimes "invalid byte sequence in UTF-8" when I read contents from a file. Note - this only happens when there are some special characters in the string. I have tried opening the file without "r:UTF-8", but still get the same error.
open(file, "r:UTF-8").each_line { |line| puts line.strip(",") } # line.strip generates the error
Contents of the file:
# encoding: UTF-8
290919,"SE","26","Sk‰l","",59.4500,17.9500,, # this errors out
290956,"CZ","45","HornÌ Bradlo","",49.8000,15.7500,, # this errors out
290958,"NO","02","Svaland","",58.4000,8.0500,, # this works
This is the CSV file I got from outside and I am trying to import it into my DB, it did not come with "# encoding: UTF-8" at the top, but I added this since I read somewhere it will fix this problem, but it did not. :(
Environment:
Ruby has a notion of an external encoding and internal encoding for each file. This allows you to work with a file in UTF-8 in your source, even when the file is stored in a more esoteric format. If your default external encoding is UTF-8 (which it is if you're on Mac OS X), all of your file I/O is going to be in UTF-8 as well. You can check this using File.open('file').external_encoding
. What you're doing when you opening your file and passing "r:UTF-8"
is forcing the same external encoding that Ruby is using by default.
Chances are, your source document isn't in UTF-8 and those non-ascii characters aren't mapping cleanly to UTF-8 (if they were, you would either get the correct characters and no error, and if they mapped by incorrectly, you would get incorrect characters and no error). What you should do is try to determine the encoding of the source document, then have Ruby transcode the document on read, like so:
File.open(file, "r:windows-1251:utf-8").each_line { |line| puts line.strip(",") }
If you need help determining the encoding of the source, give this Python library a whirl. It's based on the automatic charset detection fallback that was in Seamonkey/Mozilla (and is possibly still in Firefox).
If you want to change your file encoding, you can use gem 'charlock holmes'
$require 'charlock_holmes/string'
content = File.read('test2.txt')
if !content.is_utf8?
detection = CharlockHolmes::EncodingDetector.detect(content)
utf8_encoded_content = CharlockHolmes::Converter.convert content, detection[:encoding], 'UTF-8'
end
Then you can save your new content in a temp file and overwrite your original file.
Hope this help.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With