I have this code:
use strict;
use warnings;
use utf8;
use HTML::Entities;
use feature 'say';
binmode STDOUT, ':encoding(utf-8)';
my $t1 = "Česká Spořitelna - Q3 2014";
my $t2 = "Česká Spořitelna - Q3 2014";
say decode_entities($t1);
say decode_entities($t2);
which, when executed on my dev machine, outputs:
Česká Spořitelna - Q3 2014
Česká Spořitelna - Q3 2014
and when executed on the UAT machine (Aser Acceptance Test), outputs:
Äeská SpoÅitelna - Q3 2014
Äeská SpoÅitelna - Q3 2014
Now, on both machines, when I run perl -v we have This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi-ld
and the version of HTML::Entities is the same on both machines:
Installed: 3.69
CPAN: 3.69 up to date
My dev machine runs CentOS release 5.8 (Final) and the UAT machine runs Red Hat Enterprise Linux Server release 5.8 (Tikanga)
EDIT (regarding the output of the locale command)
The output of it is the same on both machines:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
UPDATE:
I posted the link to this question on the perl developers group on facebook and got some really useful ideas from there: compare the output bytes on the two systems. If they're identical, it's a display issue. And they are. Now, there is more than one way to do it:
1)
say join ':', map { ord } split //, decode_entities($t1);
say join ':', map { ord } split //, decode_entities($t2);
which displays 268:101:115:107:225:32:83:112:111:345:105:116:101:108:110:97:32:45:32:81:51:32:50:48:49:52 on both systems, so, the bytes are the same
2) print $t1 and $t2 output to a file on each system, then run a hexdump -C against those files and compare the output. This method also showed that the content of the files is the same
Conclusion
It is a display problem - the console (putty) does not display the characters properly.
We have this problem when we add those characters in the DB and I thought I managed to isolate it with the above code. Your answers (and some from fb) helped me find out that the decode_entities() works as expected, and our problem lies somewhere else (most probably at the mysql table charset or the mysql connection).
The encoding that the command terminals expect is different. If you want to print UTF-8 you must set both terminals to expect UTF-8 with, for instance for Romanian
LANG=ro_RO.UTF-8
as well as setting STDOUT to encode the output that way in your Perl with, for example
binmode STDOUT, ':encoding(utf-8)'
Update
I can explain what is happening, although quite why it's that way I'm not sure.
Take the first character of the string: "\x{010C}" which is a capital C caron. That is being encoded by Perl as the two-octet code "\x{C4}\x{8C}" and sent to the terminal, which, on your development machine, is decoding it and displaying it correctly.
However, on your test machine the terminal is decoding the first octet of the encoded character - C4 - as if it were ISO-8859-1, a capital A umlaut. The second octet - 8C - is ignored because it's an invalid character in that encoding.
So you need to change the code page your terminal is using. The way to do that is by setting LANG as I described, but I can't explain why it isn't working if your locale is set up correctly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With