I need to produce some UTF-16LE encoded files with CRLF line separators on a Windows 7 box. (Currently with a Strawberry 5.20.1)
I needed to mess a long time before getting a correct output and I wonder if my solution is the correct way to do because it seems overcomplicated in regard of other languages along Perl. In particular:
encoding(UTF-16)
while there is no BOM if I use either UTF-16LE
or UTF-16BE
without using an additional package File::BOM
?CRLF
handling seems buggy (it is outputted as 0D 0A 00
instead of 0D 00 0A 00
) whithout some twiddling of the filters? I doubt it could be a true bug for a language with so many users...Here are my attempts with comments, what I found correct is the last statements
use strict;
use warnings;
use utf8;
use File::BOM;
use feature 'say';
my $UTF;
my $data = "Hello, héhé, 中文.\nsecond line : my 2€"; # 中文 = zhong wen = chinese
# UTF16 BE + BOM but incorrect CRLF: "0D 0A 00" instead of "0D 00 0A 00"
open $UTF, ">:encoding(UTF-16)", "utf-16-std-be.txt" or die $!;
say $UTF $data;
close $UTF;
# same as UTF-16BE (no BOM, incorrect CRLF)
open $UTF, ">:encoding(ucs2)", "utf-ucs2.txt" or die $!;
say $UTF $data;
close $UTF;
# UTF16 BE, no BOM, incorrect CRLF
open $UTF, ">:encoding(UTF-16BE)", "utf-16-be-nobom.txt" or die $!;
say $UTF $data;
close $UTF;
# UTF16 LE, no BOM, incorrect CRLF
open $UTF, ">:encoding(UTF-16LE)", "utf-16-le-nobom-wrongcrlf.txt" or die $!;
say $UTF $data;
close $UTF;
# UTF16 LE, BOM OK but still incorrect CRLF
open $UTF, ">:encoding(UTF-16LE):via(File::BOM)", "utf-16-le-bom-wrongcrlf.txt" or die $!;
say $UTF $data;
close $UTF;
# UTF16 LE non raw incorrect
# (crlf by default on windows) -> 0A => 0D 0A
open $UTF, ">:encoding(UTF-16LE):via(File::BOM)", "utf-16-le-bom-wrongcrlf2.txt" or die $!;
print $UTF $data, "\x0a"; # 0A is magically expanded to 0D 0A but wrong
close $UTF;
# UTF16 LE + BOM + LF
# raw -> 0A => 0A
# could be correct on UNIX but I need CRLF
open $UTF, ">raw::encoding(UTF-16LE):via(File::BOM)", "utf-16-le-bom-wrongcrlf3.txt" or die $!;
say $UTF $data;
close $UTF;
# manual BOM, but CRLF OK
open $UTF, ">:raw:encoding(UTF-16LE):crlf", "utf-16-le-bommanual-crlfok.txt" or die $!;
print $UTF "\x{FEFF}";
say $UTF $data;
close $UTF;
#auto BOM, CRLF OK ?
#incorrect, says utf8 "\xA9" does not map to Unicode at c:/perl/Dwimperl-5.14/perl/lib/Encode.pm line 176.
# But I cannot see where the A9 comes from ??!
#~ open $UTF, ">:raw:encoding(UTF-16LE):via(File::BOM):crlf", "utf-16-le-autobom-crlfok1.txt" or die $!;
#~ print $UTF $data;
#~ say $UTF $data;
#~ close $UTF;
# WTF? \n becomes 0D 00 0D 0A 00
open $UTF, ">:encoding(UTF-16LE):crlf:via(File::BOM)", "utf-16-le-autobom-crlf2.txt" or die $!;
say $UTF $data;
close $UTF;
#CORRECT WAY?? : Automatic BOM, CRLF is OK
open $UTF, ">:raw:encoding(UTF-16LE):crlf:via(File::BOM)", "utf-16-le-autobom-crlfok3.txt" or die $!;
say $UTF $data;
close $UTF;
manual BOM, but CRLF OK
Yes, the following is indeed correct:
:raw:encoding(UTF-16LE):crlf + manual BOM
:raw
"clears" the existing :crlf
and :encoding
layers.:encoding
converts between bytes and Code Points.:crlf
converts between CRLF and LF.So,
Read
===================================================>
Code Code
+------+ bytes +------+ Points +-------+ Points +------+
| File |-----------| :enc |------------| :crlf |------------| Code |
+------+ +------+ CRLF +-------+ LF +------+
<===================================================
Write
You want to perform the CRLF⇔LF conversion on the Code Points (not the bytes), as it does with this setup.
CORRECT WAY?? : Automatic BOM, CRLF is OK
While :raw:encoding(UTF-16LE):crlf:via(File::BOM)
may work for a write handle, it doesn't look right (I would have expected :raw:via(File::BOM,UTF-16LE):crlf
), and it fails miserably for a read handle (at least for me with Perl 5.16.3).
I just looked, and the code behind :via(File::BOM)
does some very questionable things. I wouldn't use it.
why Perl is making a valid UTF-16 big-endian with correct BOM with encoding(UTF-16) while there is no BOM if I use either UTF-16LE or UTF-16BE without using an additional package File::BOM
Because you might not want a BOM.
why out-of-the-box the
CRLF
handling seems buggy
Adding layers adds them at the end of the list. If you want to add a layer elsewhere (as is the case here), you need to rebuild the list.
It was suggested on the development list for Perl that there should be a way distinguishing between byte layers (e.g. :unix
) and text layers (e.g. :crlf
), and that adding a byte or encoding layer should dig down and place it at the appropriate spot. But noone's acted on this yet.
In addition to simplifying your code, it would allow an UTF-16*[1] encoding layer to be added to STDIN
/STDOUT
/STDERR
(or other existing handles). I believe that's not currently possible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With