Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

create UTF-16LE with BOM and CRLF line separator on Windows

I need to produce some UTF-16LE encoded files with CRLF line separators on a Windows 7 box. (Currently with a Strawberry 5.20.1)

I needed to mess a long time before getting a correct output and I wonder if my solution is the correct way to do because it seems overcomplicated in regard of other languages along Perl. In particular:

  • why Perl is making a valid UTF-16 big-endian with correct BOM with encoding(UTF-16) while there is no BOM if I use either UTF-16LE or UTF-16BE without using an additional package File::BOM?
  • why out-of-the-box the CRLF handling seems buggy (it is outputted as 0D 0A 00 instead of 0D 00 0A 00) whithout some twiddling of the filters? I doubt it could be a true bug for a language with so many users...

Here are my attempts with comments, what I found correct is the last statements

use strict;
use warnings;
use utf8;
use File::BOM;
use feature 'say';

my $UTF;
my $data = "Hello, héhé, 中文.\nsecond line : my 2€"; # 中文 = zhong wen = chinese

# UTF16 BE + BOM but incorrect CRLF: "0D 0A 00" instead of "0D 00 0A 00"
open $UTF, ">:encoding(UTF-16)", "utf-16-std-be.txt" or die $!;
say $UTF $data;
close $UTF;

# same as UTF-16BE (no BOM, incorrect CRLF)
open $UTF, ">:encoding(ucs2)", "utf-ucs2.txt" or die $!;
say $UTF $data;
close $UTF;

# UTF16 BE, no BOM, incorrect CRLF
open $UTF, ">:encoding(UTF-16BE)", "utf-16-be-nobom.txt" or die $!;
say $UTF $data;
close $UTF;

# UTF16 LE, no BOM, incorrect CRLF
open $UTF, ">:encoding(UTF-16LE)", "utf-16-le-nobom-wrongcrlf.txt" or die $!;
say $UTF $data;
close $UTF;

# UTF16 LE, BOM OK but still incorrect CRLF
open $UTF, ">:encoding(UTF-16LE):via(File::BOM)", "utf-16-le-bom-wrongcrlf.txt" or die $!;
say $UTF $data;
close $UTF;

# UTF16 LE non raw incorrect 
# (crlf by default on windows) -> 0A => 0D 0A
open $UTF, ">:encoding(UTF-16LE):via(File::BOM)", "utf-16-le-bom-wrongcrlf2.txt" or die $!;
print $UTF $data, "\x0a"; # 0A is magically expanded to 0D 0A but wrong
close $UTF;

# UTF16 LE + BOM + LF 
# raw -> 0A => 0A
# could be correct on UNIX but I need CRLF
open $UTF, ">raw::encoding(UTF-16LE):via(File::BOM)", "utf-16-le-bom-wrongcrlf3.txt" or die $!;
say $UTF $data;
close $UTF;

# manual BOM, but CRLF OK
open $UTF, ">:raw:encoding(UTF-16LE):crlf", "utf-16-le-bommanual-crlfok.txt" or die $!;
print $UTF "\x{FEFF}";
say $UTF $data;
close $UTF;

#auto BOM, CRLF OK ?
#incorrect, says utf8 "\xA9" does not map to Unicode at c:/perl/Dwimperl-5.14/perl/lib/Encode.pm line 176.
# But I cannot see where the A9 comes from ??!
#~ open $UTF, ">:raw:encoding(UTF-16LE):via(File::BOM):crlf", "utf-16-le-autobom-crlfok1.txt" or die $!;
#~ print $UTF $data;
#~ say $UTF $data;
#~ close $UTF;

# WTF? \n becomes 0D 00 0D 0A 00
open $UTF, ">:encoding(UTF-16LE):crlf:via(File::BOM)", "utf-16-le-autobom-crlf2.txt" or die $!;
say $UTF $data;
close $UTF;

#CORRECT WAY?? : Automatic BOM, CRLF is OK
open $UTF, ">:raw:encoding(UTF-16LE):crlf:via(File::BOM)", "utf-16-le-autobom-crlfok3.txt" or die $!;
say $UTF $data;
close $UTF;
like image 858
Seki Avatar asked Mar 16 '23 02:03

Seki


1 Answers

manual BOM, but CRLF OK

Yes, the following is indeed correct:

:raw:encoding(UTF-16LE):crlf + manual BOM
  • :raw "clears" the existing :crlf and :encoding layers.
  • :encoding converts between bytes and Code Points.
  • :crlf converts between CRLF and LF.

So,

                               Read
        ===================================================>

                               Code                 Code
+------+   bytes   +------+   Points   +-------+   Points   +------+
| File |-----------| :enc |------------| :crlf |------------| Code |
+------+           +------+    CRLF    +-------+     LF     +------+ 

        <===================================================
                               Write

You want to perform the CRLF⇔LF conversion on the Code Points (not the bytes), as it does with this setup.


CORRECT WAY?? : Automatic BOM, CRLF is OK

While :raw:encoding(UTF-16LE):crlf:via(File::BOM) may work for a write handle, it doesn't look right (I would have expected :raw:via(File::BOM,UTF-16LE):crlf), and it fails miserably for a read handle (at least for me with Perl 5.16.3).

I just looked, and the code behind :via(File::BOM) does some very questionable things. I wouldn't use it.


why Perl is making a valid UTF-16 big-endian with correct BOM with encoding(UTF-16) while there is no BOM if I use either UTF-16LE or UTF-16BE without using an additional package File::BOM

Because you might not want a BOM.

why out-of-the-box the CRLF handling seems buggy

Adding layers adds them at the end of the list. If you want to add a layer elsewhere (as is the case here), you need to rebuild the list.

It was suggested on the development list for Perl that there should be a way distinguishing between byte layers (e.g. :unix) and text layers (e.g. :crlf), and that adding a byte or encoding layer should dig down and place it at the appropriate spot. But noone's acted on this yet.

In addition to simplifying your code, it would allow an UTF-16*[1] encoding layer to be added to STDIN/STDOUT/STDERR (or other existing handles). I believe that's not currently possible.


  1. Technically, any encoding for which CR != 13 or LF != 10 has this problem, so EBCDIC is also affected.
like image 67
ikegami Avatar answered Apr 24 '23 00:04

ikegami