I'm migrating some data from MS Access 2003 to MySQL 5.0 using Ruby 1.8.6 on Windows XP (writing a Rake task to do this).
Turns out the Windows string data is encoded as windows-1252 and Rails and MySQL are both assuming utf-8 input so some of the characters, such as apostrophes, are getting mangled. They wind up as "a"s with an accent over them and stuff like that.
Does anyone know of a tool, library, system, methodology, ritual, spell, or incantation to convert a windows-1252 string to utf-8?
Just open up the windows-1252 encoded file in Notepad, then choose 'Save as' and set encoding to UTF-8.
Windows-1252 is a subset of UTF-8 in terms of 'what characters are available', but not in terms of their byte-by-byte representation. Windows-1252 has characters between bytes 127 and 255 that UTF-8 has a different encoding for. Any visible character in the ASCII range (127 and below) are encoded 1:1 in UTF-8.
In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.
Windows-1252 or CP-1252 (code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German.
For Ruby 1.8.6, it appears you can use Ruby Iconv, part of the standard library:
Iconv documentation
According this helpful article, it appears you can at least purge unwanted win-1252 characters from your string like so:
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
One might then attempt to do a full conversion like so:
ic = Iconv.new('UTF-8', 'WINDOWS-1252')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With