Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I copy a file with a UTF-8 filename to another UTF-8 filename in Perl on Windows?

For example, given an empty file テスト.txt, how would I make a copy called テスト.txt.copy?

My first crack at it managed to access the file and create the new filename, but the copy generated テスト.txt.copy.

Here was my first crack at it:

#!/usr/bin/env perl

use strict;
use warnings;

use English '-no_match_vars';
use File::Basename;
use Getopt::Long;

use File::Copy;
use Win32;

my (
    $output_relfilepath,
   ) = process_command_line();

open my $fh, '>', $output_relfilepath or die $!;
binmode $fh, ':utf8';
foreach my $short_basename ( glob( '*.txt') ) {

  # skip the output basename if it's in the glob
  if ( $short_basename eq $output_relfilepath ) {
    next;
  }

  my $long_basename = Win32::GetLongPathName( $short_basename );
  my $new_basename  = $long_basename . '.copy';

  print {$fh} sprintf(
                      "short_basename = (%s)\n" .
                      " long_basename = (%s)\n" .
                      "  new_basename = (%s)\n",
                      $short_basename,
                      $long_basename,
                      $new_basename,
                     );
  copy( $short_basename, $new_basename );
}

printf(
       "\n%s done! (%d seconds elapsed)\n",
       basename( $0 ),
       time() - $BASETIME,
      );

# === subroutines ===

sub process_command_line {

  # default arguments
  my %args
    = (
       output_relfilepath => 'output.txt',
      );

  GetOptions(
             'help'                 => sub { print usage(); exit },
             'output_relfilepath=s' => \$args{output_relfilepath},
            );

  return (
          $args{output_relfilepath},
         );
}

sub usage {
  my $script_name = basename $0;

  my $usage = <<END_USAGE;
======================================================================

Test script to copy files with a UTF-8 filenames to files with
different UTF-8 filenames.  This example tries to make copies of all
.txt files with versions that end in .txt.copy.

  usage: ${script_name} (<options>)

options:

  -output_relfilepath <s>   set the output relative file path to <s>.
                            this file contains the short, long, and
                            new basenames.
                            (default: 'output.txt')

----------------------------------------------------------------------

examples:

  ${script_name}

======================================================================
END_USAGE

  return $usage;
}

Here are the contents of output.txt after execution:

short_basename = (BD9A~1.TXT)
 long_basename = (テスト.txt)
  new_basename = (テスト.txt.copy)

I've tried replacing File::Copy's copy command with a system call:

my $cmd = "copy \"${short_basename}\" \"${new_basename}\"";
print `$cmd`;

and with Win32::CopyFile:

Win32::CopyFile( $short_basename, $new_basename, 'true' );

Unfortunately, I get the same result in both cases (テスト.txt.copy). For the system call, the print shows 1 file(s) copied. as expected.

Notes:

  • I'm running Perl 5.10.0 via Strawberry Perl on Windows 7 Professional
  • I use the Win32 module to access long filenames
  • The glob returns short filenames, which I have to use to access the file
  • テスト = test (tesuto) in katakana
  • I've read perlunitut and The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
like image 543
vlee Avatar asked Feb 21 '10 00:02

vlee


3 Answers

This should be possible with the CopyFileW function from Win32API::File, which should be included with Strawberry. I've never messed with Unicode filenames myself, so I'm not sure of the details. You might need to use Encode to manually convert the filename to UTF-16LE (encode('UTF16-LE', $filename)).

like image 197
cjm Avatar answered Dec 02 '22 18:12

cjm


You're getting the long filename using Win32, which gives you a UTF-8-encoded string.

However, you're then setting the long filename using plain copy, which uses the C stdlib IO functions. The stdlib functions use the default filesystem encoding.

On modern Linuxes that's usually UTF-8, but on Windows it (sadly) never is, because the system default code page cannot be set to UTF-8. So you'll get your UTF-8 string interpreted as a code page 1252 string on a Western European Windows install, as has happened here. (On a Japanese machine it'd get interpreted as code page 932 — like Shift-JIS — which would come out something like 繝�せ繝�.)

I've not done this in Perl, but I'd suspect the Win32::CopyFile function would be more likely to be able to handle the kind of Unicode paths returned elsewhere in the Win32 module.

like image 40
bobince Avatar answered Dec 02 '22 18:12

bobince


Use Encode::Locale:

use Encode::Locale;
use Encode;
use File::Copy;

copy( encode(locale_fs => $short_basename),
      encode(locale_fs => $new_basename) ) || die $!;
like image 33
godegisel Avatar answered Dec 02 '22 20:12

godegisel