If I open a file ( and specify an encoding directly ) :
open(my $file,"<:encoding(UTF-16)","some.file") || die "error $!\n";
while(<$file>) {
print "$_\n";
}
close($file);
I can read the file contents nicely. However, if I do:
use Encode;
open(my $file,"some.file") || die "error $!\n";
while(<$file>) {
print decode("UTF-16",$_);
}
close($file);
I get the following error:
UTF-16:Unrecognised BOM d at F:/Perl/lib/Encode.pm line 174
How can I make it work with decode
?
EDIT: here are the first several bytes:
FF FE 3C 00 68 00 74 00
If you simply specify "UTF-16", Perl is going to look for the byte-order mark (BOM) to figure out how to parse it. If there is no BOM, it's going to blow up. In that case, you have to tell Encode which byte-order you have by specifying either "UTF-16LE" for little-endian or "UTF-16BE" for big-endian.
There's something else going on with your situation though, but it's hard to tell without seeing the data you have in the file. I get the same error with both snippets. If I don't have a BOM and I don't specify a byte order, my Perl complains either way. Which Perl are you using and which platform do you have? Does your platform have the native endianness of your file? I think the behaviour I see is correct according to the docs.
Also, you can't simply read a line in some unknown encoding (whatever Perl's default is) then ship that off to decode
. You might end up in the middle of a multi-byte sequence. You have to use Encode::FB_QUIET
to save the part of the buffer that you couldn't decode and add that to the next chunk of data:
open my($lefh), '<:raw', 'text-utf16.txt';
my $string;
while( $string .= <$lefh> ) {
print decode("UTF-16LE", $string, Encode::FB_QUIET)
}
You need to specify either UTF-16BE or UTF-16LE. See http://perldoc.perl.org/Encode/Unicode.html#Size%2c-Endianness%2c-and-BOM
What you're trying to do impossible.
You're reading lines of text without specifying an encoding, so every byte that contains a newline character (default \x0a
) ends a line. But this newline character may very well be in the middle of an UTF-16 character, in which case your next line can't be decoded.
If your data is UTF-16LE, this will happen all the time – line feeds are \x0a \x00
. If you have UTF16-BE, you might get lucky (newlines are \x00 \x0a
), until you get a character with \x0a
in the high byte.
So, don't do that, open the file in the right encoding.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With