Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a file with long file name with unicode in Strawberry perl not using Win32::Unicode::File?

I have a file located in a directory, with danish characters in it, on a Windows XP machine. I use Strawberry perl and would like to read this file. The following code works fine:

    use Win32::Unicode::File;
    # Some code left out....
    $fname = $mw -> getOpenFile(-filetypes=>$types);
    my $fh = Win32::Unicode::File->new;
    $fh->open('<', $fname);

The getOpenFile routine comes from Tk. Now for some reason Win32::Unicode::File has some unfortunate side effects that I cannot live with (it eats my memory, see "Out of memory" with simple Win32::Unicode::File readline loop and Strawberry Perl). Now if I try to open the file without the Win32::Unicode::File interface I get a file not found. The reason for this is that the path gets intepreted incorrectly. I have tried converting the path according to Perl: managing path encodings on Windows which doesn't work for some reason. How should I solve this? I have tried the following:

    use Encode;
    # Some code left out....
    $fname = $mw -> getOpenFile(-filetypes=>$types);
    my $fh;
    open($fh, '<', encode("utf8",$fname,Encode::FB_CROAK));

and it does not work. Any ideas?

Please forgive me if I am unclear.

Kind regards, Michael

like image 895
Dr. Mike Avatar asked Jan 05 '12 12:01

Dr. Mike


1 Answers

encode("utf8"

Perl will be using the standard C library IO functions to open files, and on Windows where filenames are natively Unicode (UTF-16 behind the scenes) that means the library has to interpret the filename in that byte-oriented interface as being in a particular encoding.

Here's the problem: the encoding picked is never UTF-8, or any other UTF. It's the locale-specific default encoding, known (misleadingly) as the ANSI code page. On a Western Windows install that's cp-1252. In general you can find out what it is by calling Win32::Codepage::get_encoding.

So by converting your string to be in that encoding, you can access it using the native file support, as long as all the character in the file's path are in the ANSI code page. For Danish on a Western machine that's OK; for Danish on a Chinese machine, or vice versa, you will always get a file-not-found error.

So if you want to support filenames with all Unicode character in on Windows you have no choice but to use the Win32 API instead, as Win32::Unicode::File does. This isn't unique to Perl; other languages without explicit support for Unicode filenames have exactly the same problem.

like image 147
bobince Avatar answered Sep 22 '22 18:09

bobince