Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-16 perl input output

I am writing a script that takes a UTF-16 encoded text file as input and outputs a UTF-16 encoded text file.

use open "encoding(UTF-16)";

open INPUT, "< input.txt"
   or die "cannot open > input.txt: $!\n";
open(OUTPUT,"> output.txt");

while(<INPUT>) {
   print OUTPUT "$_\n"
}

Let's just say that my program writes everything from input.txt into output.txt.

This WORKS perfectly fine in my cygwin environment, which is using "This is perl 5, version 14, subversion 2 (v5.14.2) built for cygwin-thread-multi-64int"

But in my Windows environment, which is using "This is perl 5, version 12, subversion 3 (v5.12.3) built for MSWin32-x64-multi-thread",

Every line in output.txt is pre-pended with crazy symbols except the first line.

For example:

<FIRST LINE OF TEXT>
਀    ㈀  ㄀Ⰰ ㈀Ⰰ 嘀愀 ㌀ 䌀栀椀愀 䐀⸀⸀⸀  儀甀愀渀最 䠀ഊ<SECOND LINE OF TEXT>
...

Can anyone give some insight on why it works on cygwin but not windows?

EDIT: After printing the encoded layers as suggested.

In Windows environment:

unix
crlf
encoding(UTF-16)
utf8
unix
crlf
encoding(UTF-16)
utf8

In Cygwin environment:

unix
perlio
encoding(UTF-16)
utf8
unix
perlio
encoding(UTF-16)
utf8

The only difference is between the perlio and crlf layer.

like image 519
allenylzhou Avatar asked Jan 16 '23 05:01

allenylzhou


1 Answers

[ I was going to wait and give a thorough answer, but it's probably better if I give you a quick answer than nothing. ]

The problem is that crlf and the encoding layers are in the wrong order. Not your fault.

For example, say you do print "a\nb\nc\n"; using UTF-16le (since it's simpler and it's probably what you actually want). You'd end up with

61 00 0D 0A 00 62 00 0D 0A 00 63 00 0D 0A 00

instead of

61 00 0D 00 0A 00 62 00 0D 00 0A 00 63 00 0D 00 0A 00

I don't think you can get the right results with the open pragma or with binmode, but it can be done using open.

open(my $fh, '<:raw:encoding(UTF-16):crlf', $qfn)

You'll need to append a :utf8 with some older version, IIRC.

It works on cygwin because the crlf layer is only added on Windows. There you'd get

61 00 0A 00 62 00 0A 00 63 00 0A 00
like image 132
ikegami Avatar answered Jan 22 '23 10:01

ikegami