Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby/Rails CSV parsing, invalid byte sequence in UTF-8

I am trying to parse a CSV file generated from an Excel spreadsheet.

Here is my code

require 'csv' file = File.open("input_file") csv = CSV.parse(file) 

But I get this error

ArgumentError: invalid byte sequence in UTF-8 

I think the error is because Excel encodes the file into ISO 8859-1 (Latin-1) and not in UTF-8

Can someone help me with a workaround for this issue, please

Thanks in advance.

like image 384
rogeliog Avatar asked Dec 05 '11 01:12

rogeliog


People also ask

Why does Ruby on rails error UTF-8 invalid byte sequence?

Ruby however doesn't know that the original encoding of the file is ISO-8859-1 and will by default interpret it as UTF-8. So, the following operation will result in the infamous “UTF-8 Invalid byte sequence”: The “invalid UTF-8 byte sequence” here is our “Å” (C5) character as it’s not present in UTF-8.

What is the default string encoding in Ruby?

Ruby’s default encoding since 2.0 is UTF-8. This means that Ruby will treat any string you input as an UTF-8 encoded string unless you tell it explicitly that it’s encoded differently. Let’s use the Å character from the introductory diagram to present this problem.

How many bytes is a UTF-8 character?

Every character in UTF-8 is a sequence of 1 up to 4 bytes. Apart from UTF-8 there are also other encodings like ISO-8859–1 or Windows-1252 — you may have seen these names before in your programming career. These encodings cover a big set of characters, including special latin characters etc.


1 Answers

You need to tell Ruby that the file is in ISO-8859-1. Change your file open line to this:

file=File.open("input_file", "r:ISO-8859-1") 

The second argument tells Ruby to open read only with the encoding ISO-8859-1.

like image 110
Linuxios Avatar answered Sep 27 '22 22:09

Linuxios