I'm importing content from an outside database that is infected with a variety of odd characters, e.g.
> str
=> "Nature’s Variety, Best Friends Animal Society team up"
From context it seems that ’ represents a right single-quote. In cp1252 encoding:
> str.encode('cp1252')
=> "Nature\xE2\x80\x99s Variety, Best Friends Animal Society team up"
So how do I convert it to the correct UTF-8 character? Here's what I've tried:
> str.encode('UTF-8')
=> "Nature’s Variety, Best Friends Animal Society team up"
> str.encode('cp1252').encode('UTF-8')
=> "Nature’s Variety, Best Friends Animal Society team up"
> str.encode('UTF-8', invalid: :replace, replace: '?', undef: :replace)
=> "Nature’s Variety, Best Friends Animal Society team up"
> str.encode('cp1252').encode('UTF-8', invalid: :replace, replace: '?', undef: :replace)
=> "Nature’s Variety, Best Friends Animal Society team up"
I'd rather find a way to do a generic re-encoding so that it will handle all such miss-encoded characters. But if I have to I'll do individual search and replacing. But I'm not able to make that work either:
> str.encode('cp1252').gsub('\xE2/x80/x99', "'")
=> "Nature\xE2\x80\x99s Variety, Best Friends Animal Society team up"
> str.encode('cp1252').gsub(%r{\xE2\x80\x99}, "'")
SyntaxError: unexpected tIDENTIFIER, expecting $end
> str.encode('cp1252').gsub(Regexp.escape('\xE2\x80\x99'), "'")
=> "Nature\xE2\x80\x99s Variety, Best Friends Animal Society team up"
I'd like to do this, but I can't even paste these characters into my REPL:
> str.gsub('’', "'")
When I try I get:
> str.gsub('C"b,b,b
* "', ",")
=> "Nature’s Variety, Best Friends Animal Society team up"
Frustrating. Any suggestions on how to encode this properly into UTF-8?
Edit: At the request for the actual bytes in the string:
> str.bytes.to_a.join(' ')
=> "78 97 116 117 114 101 195 162 226 130 172 226 132 162 115 32 86 97 114 105 101 116 121 44 32 66 101 115 116 32 70 114 105 101 110 100 115 32 65 110 105 109 97 108 32 83 111 99 105 101 116 121 32 116 101 97 109 32 117 112"
I had this problem with Fixing Incorrect String Encoding From MySQL. You need to set the proper encoding and then force it back.
fallback = {
"\u0081" => "\x81".force_encoding("CP1252"),
"\u008D" => "\x8D".force_encoding("CP1252"),
"\u008F" => "\x8F".force_encoding("CP1252"),
"\u0090" => "\x90".force_encoding("CP1252"),
"\u009D" => "\x9D".force_encoding("CP1252")
}
str.encode('CP1252', fallback: fallback).force_encoding('UTF-8')
The fallback may not be necessary depending on your data, but it ensures that it won't raise an error by handling the five bytes which are undefined in CP1252.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With