Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Same code, different results on different machines regarding UTF8 characters

Tags:

utf-8

perl

I have this code:

use strict;
use warnings;
use utf8;
use HTML::Entities;
use feature 'say';

binmode STDOUT, ':encoding(utf-8)';

my $t1 = "Česká Spořitelna - Q3 2014";
my $t2 =  "Česká Spořitelna - Q3 2014";

say decode_entities($t1);
say decode_entities($t2);

which, when executed on my dev machine, outputs:

Česká Spořitelna - Q3 2014
Česká Spořitelna - Q3 2014

and when executed on the UAT machine (Aser Acceptance Test), outputs:

Äeská SpoÅitelna - Q3 2014
Äeská SpoÅitelna - Q3 2014

Now, on both machines, when I run perl -v we have This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi-ld

and the version of HTML::Entities is the same on both machines:

    Installed: 3.69
    CPAN:      3.69  up to date

My dev machine runs CentOS release 5.8 (Final) and the UAT machine runs Red Hat Enterprise Linux Server release 5.8 (Tikanga)

EDIT (regarding the output of the locale command) The output of it is the same on both machines:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

UPDATE:

I posted the link to this question on the perl developers group on facebook and got some really useful ideas from there: compare the output bytes on the two systems. If they're identical, it's a display issue. And they are. Now, there is more than one way to do it:

1)

say join ':', map { ord } split //, decode_entities($t1);
say join ':', map { ord } split //, decode_entities($t2);

which displays 268:101:115:107:225:32:83:112:111:345:105:116:101:108:110:97:32:45:32:81:51:32:50:48:49:52 on both systems, so, the bytes are the same

2) print $t1 and $t2 output to a file on each system, then run a hexdump -C against those files and compare the output. This method also showed that the content of the files is the same

Conclusion

It is a display problem - the console (putty) does not display the characters properly. We have this problem when we add those characters in the DB and I thought I managed to isolate it with the above code. Your answers (and some from fb) helped me find out that the decode_entities() works as expected, and our problem lies somewhere else (most probably at the mysql table charset or the mysql connection).

like image 628
Tudor Constantin Avatar asked Nov 20 '25 21:11

Tudor Constantin


1 Answers

The encoding that the command terminals expect is different. If you want to print UTF-8 you must set both terminals to expect UTF-8 with, for instance for Romanian

LANG=ro_RO.UTF-8

as well as setting STDOUT to encode the output that way in your Perl with, for example

binmode STDOUT, ':encoding(utf-8)'

Update

I can explain what is happening, although quite why it's that way I'm not sure.

Take the first character of the string: "\x{010C}" which is a capital C caron. That is being encoded by Perl as the two-octet code "\x{C4}\x{8C}" and sent to the terminal, which, on your development machine, is decoding it and displaying it correctly.

However, on your test machine the terminal is decoding the first octet of the encoded character - C4 - as if it were ISO-8859-1, a capital A umlaut. The second octet - 8C - is ignored because it's an invalid character in that encoding.

So you need to change the code page your terminal is using. The way to do that is by setting LANG as I described, but I can't explain why it isn't working if your locale is set up correctly.

like image 59
Borodin Avatar answered Nov 22 '25 10:11

Borodin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!