when we import csv data, how eliminate "invalid byte sequence in UTF-8"

Tags:

utf-8

we allow users to import data via csv (using ruby 1.9.2, hence it's fastercsv).

being user data, of course, it might not be properly sanitized.

When we try to display the data in an /index method we sometimes get the error "invalid byte sequence in UTF-8" pointing to our erb where we display one of the fields widget.name

When we do the import we'd like to FORCE the incoming data to be valid... is there a ruby operator that will map a string to a valid utf8 string, eg, something like

goodstring = badstring.no_more_invalid_bytes

One example of 'bad' data is char that looks like a hyphen but is not a regular ascii hyphen. We'd prefer to map the non-utf-8 chars to a reasonable ascii equivalent (umlat-u going to u for exmaple) BUT we're okay with simply stripping the character to.

since this is when importing lots of data, it needs to be a fast built-in operator, hopefully...

Note: here is an example of the data. The file comes form windows and is 8bit ascii. when we import it and in our erb we display widget.name.inspect (instead of widget.name) we get: "Chains \x96 Accessories"

so one example of the data is a "hyphen" that's actually 8 bit code 96.

--- when we changed our csv parse to assign fldval = d.encode('UTF-8') it throws this error:

Encoding::UndefinedConversionError in StoresController#importfinderitems "\x96" from ASCII-8BIT to UTF-8

what we're looking for is a simple way to just force it to be valid utf8 regardless of origin type, even if we simply strip non-ascii.

while not as 'nice' as forcing the encoding, this works at a slight expense to our import time: d.to_s.strip.gsub(/\P{ASCII}/, '') Thank you, Mladen!

820

asked Feb 19 '11 20:02

jpw

2 Answers

Ruby 1.9 CSV has new parser that works with m17n. The parser works with Encoding of IO object in the string. Following methods: ::foreach, ::open, ::read, and ::readlines could take in optional options :encoding which you could specify the the Encoding.

For example:

CSV.read('/path/to/file', :encoding => 'windows-1251:utf-8')

Would convert all strings to UTF-8.

Also you can use the more standard encoding name 'ISO-8859-1'

CSV.read('/..', {:headers => true, :col_sep => ';', :encoding => 'ISO-8859-1'})

135

answered Oct 19 '22 13:10

Trung Lê

CSV.parse(File.read('/path/to/csv').scrub)

answered Oct 19 '22 13:10

Bill Lipa

Related questions
                            
                                Rails 3. How to display two decimal places in edit form?
                            
                                `gem install therubyracer` fails on Mac OS X Lion
                            
                                How do I generate a random 10 digit number in ruby?
                            
                                Dynamic method calling in Ruby
                            
                                How to write a shell script that starts tmux session, and then runs a ruby script
                            
                                What is the best way to write specs for code that depends on environment variables?
                            
                                Shoulda/RSpec matchers - conditional validation
                            
                                Get today's date in Jekyll with Liquid markup
                            
                                Convert Ruby Date to Integer
                            
                                Rails flash message remains for two page loads
                            
                                Ruby objects and JSON serialization (without Rails)
                            
                                Why Bundle Install is installing gems in vendor/bundle?
                            
                                Rails has_many :through Find by Extra Attributes in Join Model
                            
                                How to fake Time.now?
                            
                                How to format a string with floats in Ruby using #{variable}?
                            
                                Using Live Reload with Jekyll
                            
                                Why does "compass watch" say it cannot load sass/script/node (LoadError)?
                            
                                is not checked out... bundle install does NOT fix help!
                            
                                Ruby: Most concise way to use an ENV variable if it exists, otherwise use default value
                            
                                Ruby 2.4 and Rails 4 stack level too deep (SystemStackError)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With