Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby and encoding conversion

I'm importing a CSV file into Ruby (1.8.7). File.open('path/to/file.csv').read returns this in the console:

Stefan,Engstr\232m

The encoding is identified as iso-8859-2 by UniversalDetector (chardet gem).

UniversalDetector::chardet("Stefan,Engstr\232m")
=> {"confidence"=>0.626936305574385, "encoding"=>"ISO-8859-2"} 

Trying to convert the string yields the following:

Iconv.conv("UTF-8", "ISO-8859-2", "Stefan,Engstr\232m")
 => "Stefan,Engstrm"

whereas I would expect:

 => "Stefan,Engström"
  • Could the string really be in some other encoding?
  • I haven't seen the \232 syntax before, usually when strings are strangely encoded some weird character will show up instead, e.g. � or some chinese.

Let me know if I should provide more information or elaborate on something.

like image 649
sandstrom Avatar asked Apr 30 '26 16:04

sandstrom


1 Answers

The encoding is probably "Macintosh Roman", a couple other options would be "Mac Central European" and "Mac Icelandic". The \nnn notation uses octal so \232 is 154 in decimal and character 154 is the lower case O-umlaut ("ö") that you're expecting in all three of those encodings; I don't see 154 in any of the Windows codepages or ISO 8859 character sets. I'd guess that Mac Roman is more common than the Icelandic or Central European encodings.

Try using 'MacRoman' as your source encoding with Iconv:

>> Iconv.conv("UTF-8", "MacRoman", "Stefan,Engstr\232m")
=> "Stefan,Engström"
like image 198
mu is too short Avatar answered May 03 '26 06:05

mu is too short



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!