Note below how ã
changes to a
. NOTE2: Before you blame this on CMD.EXE and Windows pipe weirdness, see Experiment 2 below which gets a similar problem using File::Find.
The particular problem I'm trying to fix involves working with image files stored on a local drive, and manipulating the file names which may contain foreign characters. The two experiments shown below are intermediate debugging steps.
The ã
character is common in latin languages. e.g. http://pt.wikipedia.org/wiki/Cão
Experiment 1
Look closely, note how cão
becomes cao
.
Experiment 2
Here I tried using File::Find instead of piped input, in case the issue was with the Windows implementation of the |
shell operator. The issue actually gets worse, as the ~a
becomes Pi
:
Debugging update:
I tried some of the tricks listed at http://perldoc.perl.org/perlunicode.html,
e.g. use utf8
, use feature 'unicode_strings'
, etc, to no avail.
Environment and Version Info
The OS is Windows 7, 64-bit.
The Perl is:
This is perl 5, version 12, subversion 2 (v5.12.2) built for MSWin32-x64-multi-thread
(with 8 registered patches, see perl -V for more detail)
Copyright 1987-2010, Larry Wall
Binary build 1202 [293621] provided by ActiveState http://www.ActiveState.com
Built Sep 6 2010 22:53:42
Perl, as with many other scripting languages, is built on the C runtime.
On Windows, the standard MS C runtime for narrow (byte) characters uses an encoding which defaults to the Windows system encoding (‘ANSI code page’) for IO activities such as opening files or writing to the console.
The ANSI code page is always a locale-specific encoding: usually single-byte, but multi-byte in some locales (eg China, Japan etc). It is never UTF-8 or anything else capable of reproducing the whole of Unicode; which characters Perl IO can cope with is dependent on the Windows locale (“language for non-Unicode programs” setting).
Whilst console apps can be given UTF-8 using the chcp 65001
command, there are a number of serious inconsistencies which come up with doing this. This causes difficulty for a lot of tools on Windows and is something Microsoft really needs to fix, but so far their attitude is that Unicode Equals UTF-16; everyone who wants Unicode to work must use the widechar interfaces.
So you won't currently be able to deal with files that use non-ASCII filenames reliably in Perl on Windows. Sorry.
You could try Python (which added special Windows-only filename handling to get around this problem in version 2.3 onwards; see PEP 277), or one of the Unicode-aware Windows Scripting Host languages. Either way, getting Unicode out to the console on Windows still has more pitfalls.
The following 3 liner works as expected on my newly minted ActivePerl 5.12.2:
use utf8;
open($file, '>:encoding(UTF-8)', "output.txt") or die $!;
print $file "さっちゃん";
I think the culprit is cmd.exe.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With