we allow users to import data via csv (using ruby 1.9.2, hence it's fastercsv).
being user data, of course, it might not be properly sanitized.
When we try to display the data in an /index method we sometimes get the error "invalid byte sequence in UTF-8" pointing to our erb where we display one of the fields widget.name
When we do the import we'd like to FORCE the incoming data to be valid... is there a ruby operator that will map a string to a valid utf8 string, eg, something like
goodstring = badstring.no_more_invalid_bytes
One example of 'bad' data is char that looks like a hyphen but is not a regular ascii hyphen. We'd prefer to map the non-utf-8 chars to a reasonable ascii equivalent (umlat-u going to u for exmaple) BUT we're okay with simply stripping the character to.
since this is when importing lots of data, it needs to be a fast built-in operator, hopefully...
Note: here is an example of the data. The file comes form windows and is 8bit ascii. when we import it and in our erb we display widget.name.inspect (instead of widget.name) we get: "Chains \x96 Accessories"
so one example of the data is a "hyphen" that's actually 8 bit code 96.
--- when we changed our csv parse to assign fldval = d.encode('UTF-8') it throws this error:
Encoding::UndefinedConversionError in StoresController#importfinderitems "\x96" from ASCII-8BIT to UTF-8
what we're looking for is a simple way to just force it to be valid utf8 regardless of origin type, even if we simply strip non-ascii.
while not as 'nice' as forcing the encoding, this works at a slight expense to our import time: d.to_s.strip.gsub(/\P{ASCII}/, '') Thank you, Mladen!
Google Spreadsheet correctly exports UTF-8 encoded CSV files by default. From the File menu, choose Download As and select Comma-separated values. The downloaded file will be UTF-8 encoded.
This error is created when the uploaded file is not in a UTF-8 format. UTF-8 is the dominant character encoding format on the World Wide Web. This error occurs because the software you are using saves the file in a different type of encoding, such as ISO-8859, instead of UTF-8.
Ruby 1.9 CSV has new parser that works with m17n. The parser works with Encoding of IO object in the string. Following methods: ::foreach, ::open, ::read, and ::readlines
could take in optional options :encoding
which you could specify the the Encoding.
For example:
CSV.read('/path/to/file', :encoding => 'windows-1251:utf-8')
Would convert all strings to UTF-8.
Also you can use the more standard encoding name 'ISO-8859-1'
CSV.read('/..', {:headers => true, :col_sep => ';', :encoding => 'ISO-8859-1'})
CSV.parse(File.read('/path/to/csv').scrub)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With