Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Question about pathname encoding

What have I done to get such a strange encoding in this path-name?
In my file manager (Dolphin) the path-name looks good.

#!/usr/local/bin/perl
use warnings;
use 5.014;
use utf8;
use open qw( :encoding(UTF-8) :std );
use File::Find;
use Devel::Peek;
use Encode qw(decode);

my $string;
find( sub { $string = $File::Find::name }, 'Delibes, Léo' );
$string =~ s|Delibes,\ ||;
$string =~ s|\..*\z||;
my ( $s1, $s2 ) = split m|/|, $string, 2;

say Dump $s1;
say Dump $s2;

# SV = PV(0x824b50) at 0x9346d8
#   REFCNT = 1
#   FLAGS = (PADMY,POK,pPOK,UTF8)
#   PV = 0x93da30 "L\303\251o"\0 [UTF8 "L\x{e9}o"]
#   CUR = 4
#   LEN = 16

# SV = PV(0x7a7150) at 0x934c30
#   REFCNT = 1
#   FLAGS = (PADMY,POK,pPOK,UTF8)
#   PV = 0x7781e0 "Lakm\303\203\302\251"\0 [UTF8 "Lakm\x{c3}\x{a9}"]
#   CUR = 8
#   LEN = 16

say $s1;
say $s2;

# Léo
# Lakmé

$s1 = decode( 'utf-8', $s1 );
$s2 = decode( 'utf-8', $s2 );

say $s1;
say $s2;

# L�o
# Lakmé
like image 674
sid_com Avatar asked Dec 22 '22 08:12

sid_com


1 Answers

Unfortunately your operating system's pathname API is another "binary interface" where you will have to use Encode::encode and Encode::decode to get predictable results.

Most operating systems treat pathnames as a sequence of octets (i.e. bytes). Whether that sequence should be interpreted as latin-1, UTF-8 or other character encoding is an application decision. Consequently the value returned by readdir() is simply a sequence of octets, and File::Find doesn't know that you want the path name as Unicode code points. It forms $File::Find::name by simply concatenating the directory path (which you supplied) with the value returned by your OS via readdir(), and that's how you got code points mashed with octets.

Rule of thumb: Whenever passing path names to the OS, Encode::encode() it to make sure it is a sequence of octets. When getting a path name from the OS, Encode::decode() it to the character set that your application wants it in.

You can make your program work by calling find this way:

find( sub { ... }, Encode::encode('utf8', 'Delibes, Léo') );

And then calling Encode::decode() when using the value of $File::Find::name:

my $path = Encode::decode('utf8', $File::Find::name);

To be more clear, this is how $File::Find::name was formed:

use Encode;

# This is a way to get $dir to be represented as a UTF-8 string

my $dir = 'L' .chr(233).'o'.chr(256);
chop $dir;

say "dir: ", d($dir); # length = 3

# This is what readdir() is returning:

my $leaf = encode('utf8', 'Lakem' . chr(233));

say "leaf: ", d($leaf); # length = 7

$File::Find::name = $dir . '/' . $leaf;

say "File::Find::name: ", d($File::Find::name);

sub d {
  join(' ', map { sprintf("%02X", ord($_)) } split('', $_[0]))
}
like image 179
ErikR Avatar answered Dec 31 '22 15:12

ErikR