Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Perl lose foreign characters on Windows; can this be fixed (if so, how)?

Note below how ã changes to a. NOTE2: Before you blame this on CMD.EXE and Windows pipe weirdness, see Experiment 2 below which gets a similar problem using File::Find.

The particular problem I'm trying to fix involves working with image files stored on a local drive, and manipulating the file names which may contain foreign characters. The two experiments shown below are intermediate debugging steps.

The ã character is common in latin languages. e.g. http://pt.wikipedia.org/wiki/Cão

Experiment 1

Look closely, note how cão becomes cao. alt text

Experiment 2

Here I tried using File::Find instead of piped input, in case the issue was with the Windows implementation of the | shell operator. The issue actually gets worse, as the ~a becomes Pi: alt text


Debugging update:

I tried some of the tricks listed at http://perldoc.perl.org/perlunicode.html, e.g. use utf8, use feature 'unicode_strings', etc, to no avail.


Environment and Version Info

The OS is Windows 7, 64-bit.

The Perl is:

This is perl 5, version 12, subversion 2 (v5.12.2) built for MSWin32-x64-multi-thread
(with 8 registered patches, see perl -V for more detail)

Copyright 1987-2010, Larry Wall

Binary build 1202 [293621] provided by ActiveState http://www.ActiveState.com
Built Sep  6 2010 22:53:42
like image 562
Alex R Avatar asked Dec 24 '10 16:12

Alex R


2 Answers

Perl, as with many other scripting languages, is built on the C runtime.

On Windows, the standard MS C runtime for narrow (byte) characters uses an encoding which defaults to the Windows system encoding (‘ANSI code page’) for IO activities such as opening files or writing to the console.

The ANSI code page is always a locale-specific encoding: usually single-byte, but multi-byte in some locales (eg China, Japan etc). It is never UTF-8 or anything else capable of reproducing the whole of Unicode; which characters Perl IO can cope with is dependent on the Windows locale (“language for non-Unicode programs” setting).

Whilst console apps can be given UTF-8 using the chcp 65001 command, there are a number of serious inconsistencies which come up with doing this. This causes difficulty for a lot of tools on Windows and is something Microsoft really needs to fix, but so far their attitude is that Unicode Equals UTF-16; everyone who wants Unicode to work must use the widechar interfaces.

So you won't currently be able to deal with files that use non-ASCII filenames reliably in Perl on Windows. Sorry.

You could try Python (which added special Windows-only filename handling to get around this problem in version 2.3 onwards; see PEP 277), or one of the Unicode-aware Windows Scripting Host languages. Either way, getting Unicode out to the console on Windows still has more pitfalls.

like image 121
bobince Avatar answered Sep 28 '22 12:09

bobince


The following 3 liner works as expected on my newly minted ActivePerl 5.12.2:

use utf8;
open($file, '>:encoding(UTF-8)', "output.txt") or die $!;
print $file "さっちゃん";

I think the culprit is cmd.exe.

like image 26
David Heffernan Avatar answered Sep 29 '22 12:09

David Heffernan