Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

when we import csv data, how eliminate "invalid byte sequence in UTF-8"

Tags:

ruby

utf-8

we allow users to import data via csv (using ruby 1.9.2, hence it's fastercsv).

being user data, of course, it might not be properly sanitized.

When we try to display the data in an /index method we sometimes get the error "invalid byte sequence in UTF-8" pointing to our erb where we display one of the fields widget.name

When we do the import we'd like to FORCE the incoming data to be valid... is there a ruby operator that will map a string to a valid utf8 string, eg, something like

goodstring = badstring.no_more_invalid_bytes 

One example of 'bad' data is char that looks like a hyphen but is not a regular ascii hyphen. We'd prefer to map the non-utf-8 chars to a reasonable ascii equivalent (umlat-u going to u for exmaple) BUT we're okay with simply stripping the character to.

since this is when importing lots of data, it needs to be a fast built-in operator, hopefully...


Note: here is an example of the data. The file comes form windows and is 8bit ascii. when we import it and in our erb we display widget.name.inspect (instead of widget.name) we get: "Chains \x96 Accessories"

so one example of the data is a "hyphen" that's actually 8 bit code 96.

--- when we changed our csv parse to assign fldval = d.encode('UTF-8') it throws this error:

Encoding::UndefinedConversionError in StoresController#importfinderitems "\x96" from ASCII-8BIT to UTF-8 

what we're looking for is a simple way to just force it to be valid utf8 regardless of origin type, even if we simply strip non-ascii.


while not as 'nice' as forcing the encoding, this works at a slight expense to our import time: d.to_s.strip.gsub(/\P{ASCII}/, '') Thank you, Mladen!

like image 820
jpw Avatar asked Feb 19 '11 20:02

jpw


People also ask

Are CSV files UTF-8?

Google Spreadsheet correctly exports UTF-8 encoded CSV files by default. From the File menu, choose Download As and select Comma-separated values. The downloaded file will be UTF-8 encoded.

What is UTF-8 error?

This error is created when the uploaded file is not in a UTF-8 format. UTF-8 is the dominant character encoding format on the World Wide Web. This error occurs because the software you are using saves the file in a different type of encoding, such as ISO-8859, instead of UTF-8.


2 Answers

Ruby 1.9 CSV has new parser that works with m17n. The parser works with Encoding of IO object in the string. Following methods: ::foreach, ::open, ::read, and ::readlines could take in optional options :encoding which you could specify the the Encoding.

For example:

CSV.read('/path/to/file', :encoding => 'windows-1251:utf-8') 

Would convert all strings to UTF-8.

Also you can use the more standard encoding name 'ISO-8859-1'

CSV.read('/..', {:headers => true, :col_sep => ';', :encoding => 'ISO-8859-1'}) 
like image 135
Trung Lê Avatar answered Oct 19 '22 13:10

Trung Lê


CSV.parse(File.read('/path/to/csv').scrub) 
like image 32
Bill Lipa Avatar answered Oct 19 '22 13:10

Bill Lipa