I'm running latest perl under german Windows 7 and I want to use utf8 everywhere in my perl programs (for the script, the file contents, file names, mail texts, etc.).
All works fine, but I'm facing problems when trying to process files having special characters in filename. Even system
calls do not work well. So (how) can I tell perl to use utf8
everywhere?
I tried a while with encode
and decode
but it's very unclear why that works as it works... Also I need to encode('cp850', TEXT)
for a correct display in the command prompt window.
Examples:
When I need to copy a file, it only works when I use File::copy(encode("iso-8859-1", $filename), ...)
and when I want to work with pdf file contens the successful command is system(encode('cp850', sprintf('pdftk.exe %s...', decode('utf8', $file))));
Why is that (especially the decode in the system call) and is there a more easy way? Maybe something with use open ':encoding...'
, but I had no luck so far.
Here's the real, concrete and definite answer by someone who just recently went through this exact problem:
You cannot, on windows, have Perl 5.28.0 or below use UTF8 for everything.
This is the why: As of Perl 5.28.0 the perl core file handling functions are fatally fucked for this. Windows stores filenames as (simply put) UTF16, and the windows api wide character functions return file names as wide chars, similar to what Perl already operates with internally. However when getting these from the file system, the perl core converts them into bytes in the encoding of the local system. Vice versa when writing file names. So, morally, you have this kind of flow, paraphrased as Perl:
use utf8;
sub readdir_perl {
my $dir = shift;
my $fn = readdir $dir;
$fn = encode $fn, CP_ACP;
return $fn;
}
sub open_perl {
my $fn = shift;
$fn = decode $fn, CP_ACP;
open my $FH, $fn;
return $FH;
}
Two important notes:
?
character, leaving you with a handful of garbage.That said, what can you do?
system
as normal, but ensure you treat everything as bytes and decode/encode appropiately. Some example code exists. You'll also need to implement ALL filehandling manually, and you can't usefully monkeypatch other code to use the LongPath functions.First set the codepage of your command prompt to 65001
chcp 65001
This will allow you to use and display utf8 characters in the command prompt. File names are dependent on the file system being used. NTFS stores file names using the UTF-16LE encoding. See this question on how to create and access files with Unicode file names on Windows.
System() commands need to be encoded in the same codepage as the command prompt so after doing a chcp 65001
you can encode the system()
command in utf8
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With