Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using utf8 everywhere in/with perl script

I'm running latest perl under german Windows 7 and I want to use utf8 everywhere in my perl programs (for the script, the file contents, file names, mail texts, etc.).

All works fine, but I'm facing problems when trying to process files having special characters in filename. Even system calls do not work well. So (how) can I tell perl to use utf8 everywhere?

I tried a while with encode and decode but it's very unclear why that works as it works... Also I need to encode('cp850', TEXT) for a correct display in the command prompt window.

Examples:

When I need to copy a file, it only works when I use File::copy(encode("iso-8859-1", $filename), ...) and when I want to work with pdf file contens the successful command is system(encode('cp850', sprintf('pdftk.exe %s...', decode('utf8', $file))));

Why is that (especially the decode in the system call) and is there a more easy way? Maybe something with use open ':encoding...', but I had no luck so far.

like image 692
toshniba Avatar asked Sep 11 '18 14:09

toshniba


2 Answers

Here's the real, concrete and definite answer by someone who just recently went through this exact problem:

You cannot, on windows, have Perl 5.28.0 or below use UTF8 for everything.

This is the why: As of Perl 5.28.0 the perl core file handling functions are fatally fucked for this. Windows stores filenames as (simply put) UTF16, and the windows api wide character functions return file names as wide chars, similar to what Perl already operates with internally. However when getting these from the file system, the perl core converts them into bytes in the encoding of the local system. Vice versa when writing file names. So, morally, you have this kind of flow, paraphrased as Perl:

use utf8;

sub readdir_perl {
    my $dir = shift;
    my $fn = readdir $dir;
    $fn = encode $fn, CP_ACP;
    return $fn;
}

sub open_perl {
    my $fn = shift;
    $fn = decode $fn, CP_ACP;
    open my $FH, $fn;
    return $FH;
}

Two important notes:

  • All of the stuff above is paraphrased. It's roughly how the perl core implements these functions in C, and you cannot usefully change them, nor CP_ACP, for the duration of a program.
  • The conversion from wide chars to CP_ACP is forced through. It doesn't bail on errors. If there are wide chars that cannot be represented usefully, it converts them to a ? character, leaving you with a handful of garbage.

That said, what can you do?

  1. Use Win32::LongPath. It handles most of what you need internally. For files. Be aware that it only works reliably on volumes with shortpaths configured on, which is usually C: and nothing else. Use system as normal, but ensure you treat everything as bytes and decode/encode appropiately. Some example code exists. You'll also need to implement ALL filehandling manually, and you can't usefully monkeypatch other code to use the LongPath functions.
  2. Wait until the perl core is fixed. As far as i know there currently are not any plans to do this anytime soon, as any kind of simple fix is likely to break legacy scripts that rely on the UTF16 to system codepage conversion to usefully munge unicode umlauts into äöü on german systems, etc.
  3. Use a different language. Maybe PowerShell.
like image 131
Mithaldu Avatar answered Nov 11 '22 10:11

Mithaldu


First set the codepage of your command prompt to 65001

chcp 65001

This will allow you to use and display utf8 characters in the command prompt. File names are dependent on the file system being used. NTFS stores file names using the UTF-16LE encoding. See this question on how to create and access files with Unicode file names on Windows.

System() commands need to be encoded in the same codepage as the command prompt so after doing a chcp 65001 you can encode the system() command in utf8

like image 45
JGNI Avatar answered Nov 11 '22 10:11

JGNI