My Perl app and MySQL database now handle incoming UTF-8 data properly, but I have to convert the pre-existing data. Some of the data appears to have been encoded as CP-1252 and not decoded as such before being encoded as UTF-8 and stored in MySQL. I've read the O'Reilly article Turning MySQL data in latin1 to utf8 utf-8, but although it's frequently referenced, it's not a definitive solution.
I've looked at Encode::DoubleEncodedUTF8 and Encoding::FixLatin, but neither worked on my data.
This is what I've done so far:
#Return the $bytes from the DB using BINARY()
my $characters = decode('utf-8', $bytes);
my $good = decode('utf-8', encode('cp-1252', $characters));
That fixes most of the cases, but if run against proplerly-encoded records, it mangles them. I've tried using Encode::Guess and Encode::Detect, but they cannot distinguish between the properly encoded and the misencoded records. So I just undo the conversion if the \x{FFFD} character is found after the conversion.
Some records, though, are only partially converted. Here's an example where the left curly quotes are properly converted, but the right curly quotes get mangled.
perl -CO -MEncode -e 'print decode("utf-8", encode("cp-1252", decode("utf-8", "\xC3\xA2\xE2\x82\xAC\xC5\x93four score\xC3\xA2\xE2\x82\xAC\xC2\x9D")))'
And and here's an example where a right single quote did not convert:
perl -CO -MEncode -e 'print decode("utf-8", encode("cp-1252", decode("utf-8", "bob\xC3\xAF\xC2\xBF\xC2\xBDs")))'
Am I also dealing with double encoded data here? What more must I do to convert these records?
With the "four score" example, it almost certainly is doubly-encoded data. It looks like either:
(Naturally, both cases look identical)
Now, that's what you expected, so why didn't your code work?
First, I'd like to refer you to this table which shows the conversion from cp1252 to unicode. The important thing I want you to note is that there are some bytes (such as 0x9D) which are not valid in cp1252.
When I imagine writing a cp1252 to utf8 converter, therefore, I need to do something with those bytes that aren't in cp1252. The only sensible thing I can think of is to transform the unknown bytes into unicode characters at the same value. In fact, this appears to be what happened. Let's take your "four score" example back one step at a time.
First, since it is valid utf-8, let's decode with:
$ perl -CO -MEncode -e '$a=decode("utf-8",
"\xC3\xA2\xE2\x82\xAC\xC5\x93" .
"four score" .
"\xC3\xA2\xE2\x82\xAC\xC2\x9D");
for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
This yields this sequence of unicode code points:
e2 20ac 153 66 6f 75 72 20 73 63 6f 72 65 e2 20ac 9d
("fmt" is a unix command that just reformats text so that we have nice line breaks with long data)
Now, let's represent each of these as a byte in cp1252, but when the unicode character can't be represented in cp1252, let's just replace it with a byte that has the same numeric value. (Instead of the default, which is to replace it with a question mark) We should then, if we're correct about what happened to the data, have a valid utf8 byte stream.
$ perl -CO -MEncode -e '$a=decode("utf-8",
"\xC3\xA2\xE2\x82\xAC\xC5\x93" .
"four score" .
"\xC3\xA2\xE2\x82\xAC\xC2\x9D");
$a=encode("cp-1252", $a, sub { chr($_[0]) } );
for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
That third argument to encode - when it's a sub - tells what to do with unrepresentable characters.
This yields:
e2 80 9c 66 6f 75 72 20 73 63 6f 72 65 e2 80 9d
Now, this is a valid utf8 byte stream. Can't tell that by inspection? Well, let's ask perl to decode this byte stream as utf8:
$ perl -CO -MEncode -e '$a=decode("utf-8",
"\xC3\xA2\xE2\x82\xAC\xC5\x93" .
"four score" .
"\xC3\xA2\xE2\x82\xAC\xC2\x9D");
$a=encode("cp-1252", $a, sub { chr($_[0]) } );
$a=decode("utf-8", $a, 1);
for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
Passing "1" as the third argument to decode ensures that our code will croak if the byte stream is invalid. This yields:
201c 66 6f 75 72 20 73 63 6f 72 65 201d
Or printed:
$ perl -CO -MEncode -e '$a=decode("utf-8",
"\xC3\xA2\xE2\x82\xAC\xC5\x93" .
"four score" .
"\xC3\xA2\xE2\x82\xAC\xC2\x9D");
$a=encode("cp-1252", $a, sub { chr($_[0]) } );
$a=decode("utf-8", $a, 1);
print "$a\n"'
“four score”
So I think that the full algorithm should be this:
.
sub demangle {
my($a) = shift;
eval { # the non-string form of eval just traps exceptions
# so that we return undef on exception
local $SIG{__WARN__} = sub {}; # No warning messages
$a = decode("utf-8", $a, 1);
encode("cp-1252", $a, sub {$_[0] <= 255 or die $_[0]; chr($_[0])});
}
}
This is based on the assumption that it's actually very rare for a string that isn't all-ASCII to be a valid utf-8 byte stream unless it really is utf-8. That is, it's not the sort of thing that happens accidentally.
EDITED TO ADD:
Note that this technique does not help too much with your "bob's" example, unfortunately. I think that that string also went through two rounds of cp1252-to-utf8 conversion, but unfortunately there was also some corruption. Using the same technique as before, we first read the byte sequence as utf8 and look at the sequence of unicode character references we get:
$ perl -CO -MEncode -e '$a=decode("utf-8",
"bob\xC3\xAF\xC2\xBF\xC2\xBDs");
for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
This yields:
62 6f 62 ef bf bd 73
Now, it just so happens that for the three bytes ef bf bd, unicode and cp1252 agree. So representing this sequence of unicode code points in cp1252 is just:
62 6f 62 ef bf bd 73
That is, the same sequence of numbers. Now, this is in fact a valid utf-8 byte stream, but what it decodes to may surprise you:
$ perl -CO -MEncode -e '$a=decode("utf-8",
"bob\xC3\xAF\xC2\xBF\xC2\xBDs");
$a=encode("cp-1252", $a, sub { chr(shift) } );
$a=decode("utf-8", $a, 1);
for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
62 6f 62 fffd 73
That is, the utf-8 byte stream, though a legitimate utf-8 byte stream, encoded the character 0xFFFD, which is generally used for "untranslatable character". I suspect that what happened here is that the first *-to-utf8 transformation saw a character it didn't recognize and replaced it with "untranslatable". There's no way to then programmatically recover the original character.
A consequence is that you can't detect whether a stream of bytes is valid utf8 (needed for that algorithm I gave above) simply by doing a decode and then looking for 0xFFFD. Instead, you should use something like this:
sub is_valid_utf8 {
defined(eval { decode("utf-8", $_[0], 1) })
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With