Why does encoding, then decoding strings make Arabic characters lose their context?

Tags:

I'm (belatedly) testing Unicode waters for the first time and am failing to understand why the process of encoding, then decoding an Arabic string is having the effect of separating out the individual characters that the word is made of.

In the example below, the word "ﻟﻠﺒﻴﻊ" comprises of 5 individual letters: "ع","ي","ب","ل","ل", written right to left. Depending on the surrounding context (adjacent letters), the letters change form

use strict;
use warnings;
use utf8;

binmode( STDOUT, ':utf8' );

use Encode qw< encode decode >;

my $str = 'ﻟﻠﺒﻴﻊ';                 # "For sale" 
my $enc = encode( 'UTF-8', $str );
my $dec = decode( 'UTF-8', $enc );

my $decoded = pack 'U0W*', map +ord, split //, $enc;

print "Original string : $str\n";     #  ل ل ب ي ع   
print "Decoded string 1: $dec\n"      #  ل ل ب ي ع
print "Decoded string 2: $decoded\n"; #  ل ل ب ي ع

ADDITIONAL INFO

When pasting the string to this post, the rendering is reversed so it looks like "ﻊﻴﺒﻠﻟ". I'm reversing it manually to get it to look 'right'. The correct hexdump is given below:
```
$ echo "ﻟﻠﺒﻴﻊ" | hexdump
0000000 bbef ef8a b4bb baef ef92 a0bb bbef 0a9f
0000010
```

The output of the Perl script (per ikegami's request):

$ perl unicode.pl | od -t x1
0000000 4f 72 69 67 69 6e 61 6c 20 73 74 72 69 6e 67 20
0000020 3a 20 d8 b9 d9 8a d8 a8 d9 84 d9 84 0a 44 65 63
0000040 6f 64 65 64 20 73 74 72 69 6e 67 20 31 3a 20 d8
0000060 b9 d9 8a d8 a8 d9 84 d9 84 0a 44 65 63 6f 64 65
0000100 64 20 73 74 72 69 6e 67 20 32 3a 20 d8 b9 d9 8a
0000120 d8 a8 d9 84 d9 84 0a
0000127

And if I just print $str:

$ perl unicode.pl | od -t x1
0000000 4f 72 69 67 69 6e 61 6c 20 73 74 72 69 6e 67 20
0000020 3a 20 d8 b9 d9 8a d8 a8 d9 84 d9 84 0a
0000035

Finally (per ikegami's request):

$ grep 'For sale' unicode.pl | od -t x1
0000000 6d 79 20 24 73 74 72 20 3d 20 27 d8 b9 d9 8a d8
0000020 a8 d9 84 d9 84 27 3b 20 20 23 20 22 46 6f 72 20
0000040 73 61 6c 65 22 20 0a
0000047

Perl details

$ perl -v

This is perl, v5.10.1 (*) built for x86_64-linux-gnu-thread-multi
(with 53 registered patches, see perl -V for more detail)

Outputting to file reverses the string: "ﻊﻴﺒﻠﻟ"

QUESTIONS

I have several:

How can I preserve the context of each character while printing?
Why is the original string printed out to screen as individual letters, even though it hasn't been 'processed'?
When printing to file, the word is reversed (I'm guessing this is due to the script's right-to-left nature). Is there a way I can prevent this from happening?
Why does the following not hold true: $str !~ /\P{Bidi_Class: Right_To_Left}/;

942

asked Jan 30 '13 20:01

Zaid

2 Answers

Source code returned by StackOverflow (as fetched using wget):

... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a ...

U+FEDF ARABIC LETTER LAM INITIAL FORM
U+FEE0 ARABIC LETTER LAM MEDIAL FORM
U+FE92 ARABIC LETTER BEH MEDIAL FORM
U+FEF4 ARABIC LETTER YEH MEDIAL FORM
U+FECA ARABIC LETTER AIN FINAL FORM

perl output I get from the source code returned by StackOverflow:

... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a 0a
... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a 0a
... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a 0a

U+FEDF ARABIC LETTER LAM INITIAL FORM
U+FEE0 ARABIC LETTER LAM MEDIAL FORM
U+FE92 ARABIC LETTER BEH MEDIAL FORM
U+FEF4 ARABIC LETTER YEH MEDIAL FORM
U+FECA ARABIC LETTER AIN FINAL FORM
U+000A LINE FEED

So I get exactly what's in the source, as I should.

perl output you got:

... d8 b9 d9 8a d8 a8 d9 84 d9 84 0a
... d8 b9 d9 8a d8 a8 d9 84 d9 84 0a
... d8 b9 d9 8a d8 a8 d9 84 d9 84 0a

U+0639 ARABIC LETTER AIN
U+064A ARABIC LETTER YEH
U+0628 ARABIC LETTER BEH
U+0644 ARABIC LETTER LAM
U+0644 ARABIC LETTER LAM
U+000A LINE FEED

Ok, so you could have a buggy Perl (that reverses and changes Arabic characters and only those), but it's far more likely that your sources doesn't contain what you think it does. You need to check what bytes form up your source.

echo output you got:

ef bb 8a ef bb b4 ef ba 92 ef bb a0 ef bb 9f 0a

U+FECA ARABIC LETTER AIN FINAL FORM
U+FEF4 ARABIC LETTER YEH MEDIAL FORM
U+FE92 ARABIC LETTER BEH MEDIAL FORM
U+FEE0 ARABIC LETTER LAM MEDIAL FORM
U+FEDF ARABIC LETTER LAM INITIAL FORM
U+000A LINE FEED

There are significant differences in what you got from perl and from echo, so it's no surprise they show up differently.

Output inspected using:

$ perl -Mcharnames=:full -MEncode=decode_utf8 -E'
   say sprintf("U+%04X %s", $_, charnames::viacode($_))
      for unpack "C*", decode_utf8 pack "H*", $ARGV[0] =~ s/\s//gr;
' '...'

(Don't forget to swap the bytes of hexdump.)

140

answered Oct 11 '22 11:10

ikegami

Maybe something odd with your shell? If I redirect the output to a file, the result will be the same. Please try this out:

use strict;
use warnings;
use utf8;

binmode( STDOUT, ':utf8' );

use Encode qw< encode decode >;

my $str = 'ﻟﻠﺒﻴﻊ';                 # "For sale" 
my $enc = encode( 'UTF-8', $str );
my $dec = decode( 'UTF-8', $enc );

my $decoded = pack 'U0W*', map +ord, split //, $enc;

open(F1,'>',"origiinal.txt") or die;
open(F2,'>',"decoded.txt") or die;
open(F3,'>',"decoded2.txt") or die;

binmode(F1, ':utf8');binmode(F2, ':utf8');binmode(F3, ':utf8');

print F1 "$str\n";     #  ل ل ب ي ع   
print F2 "$dec\n";     #  ل ل ب ي ع
print F3 "$decoded\n";

answered Oct 11 '22 13:10

user1126070

Related questions
                            
                                How can I read lines from the end of file in Perl?
                            
                                How do I use perl like sed?
                            
                                How do I interpret the output of Devel::Leak
                            
                                how to get longest repeating string in substring from suffix tree
                            
                                How do you downgrade a bigrat?
                            
                                Creating An "Autopilot" For Lander in Perl
                            
                                stack trace with readable subroutine arguments
                            
                                Extending Perl is breaking dynamic loading
                            
                                How to add custom modules to perl carton?
                            
                                Perl exec('/usr/bin/php -v') hangs on CentOS 6.6 unless STDIN is closed first
                            
                                Multi-site aware PSGI application development
                            
                                Parsing Perl 5 and examining the syntax tree
                            
                                How can I create an RPM from a module and recursively create separate independent RPMs for the dependencies?
                            
                                Umlaut character not accepted via keyboard (codepage 65001, UTF-8) to be read by perl script
                            
                                Virtual Filesystem in Perl with Fuse
                            
                                How can I get the file and line number where a Perl subroutine reference was created?
                            
                                Is there an equivalent to the perl debugger 'x' in pdl2 (or Devel::REPL)?
                            
                                How to normalize Perl function arguments for memoization?
                            
                                How to stop input in Perl?
                            
                                Import password-protected xlsx workbook into R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does encoding, then decoding strings make Arabic characters lose their context?

Tags:

unicode

perl

arabic

Zaid

People also ask

2 Answers

ikegami

user1126070

Recent Activity

Donate For Us