When I use iconv to convert from UTF16 to UTF8 then all is fine but vice versa it does not work. I have these files:
a-16.strings: Little-endian UTF-16 Unicode c program text a-8.strings: UTF-8 Unicode c program text, with very long lines
The text look OK in editor. When I run this:
iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16.strings
Then I get this result:
b-16.strings: data a-16.strings: Little-endian UTF-16 Unicode c program text a-8.strings: UTF-8 Unicode c program text, with very long lines
The file
utility does not show expected file format and the text does not look good in editor either. Could it be that iconv does not create proper BOM? I run it on MAC command line.
Why is not the b-16 in proper UTF-16LE format? Is there another way of converting utf8 to utf16?
More elaboration is bellow.
$ iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16le-BAD-fromUTF8.strings $ iconv -f UTF-8 -t UTF-16 a-8.strings > b-16be.strings $ iconv -f UTF-16 -t UTF-16LE b-16be.strings > b-16le-BAD-fromUTF16BE.strings $ file *s a-16.strings: Little-endian UTF-16 Unicode c program text, with very long lines a-8.strings: UTF-8 Unicode c program text, with very long lines b-16be.strings: Big-endian UTF-16 Unicode c program text, with very long lines b-16le-BAD-fromUTF16BE.strings: data b-16le-BAD-fromUTF8.strings: data $ od -c a-16.strings | head 0000000 377 376 / \0 * \0 \0 \f 001 E \0 S \0 K \0 $ od -c a-8.strings | head 0000000 / * * * Č ** E S K Y ( J V O $ od -c b-16be.strings | head 0000000 376 377 \0 / \0 * \0 * \0 * \0 001 \f \0 E $ od -c b-16le-BAD-fromUTF16BE.strings | head 0000000 / \0 * \0 * \0 * \0 \0 \f 001 E \0 S \0 $ od -c b-16le-BAD-fromUTF8.strings | head 0000000 / \0 * \0 * \0 * \0 \0 \f 001 E \0 S \0
It is clear the BOM is missing whenever I run conversion to UTF-16LE. Any help on this?
UTF-16LE
tells iconv
to generate little-endian UTF-16 without a BOM (Byte Order Mark). Apparently it assumes that since you specified LE
, the BOM isn't necessary.
UTF-16
tells it to generate UTF-16 text (in the local machine's byte order) with a BOM.
If you're on a little-endian machine, I don't see a way to tell iconv
to generate big-endian UTF-16 with a BOM, but I might just be missing something.
I find that the file
command doesn't recognize UTF-16 text without a BOM, and your editor might not either. But if you run iconv -f UTF-16LE -t UTF_8 b-16 strings
, you should get a valid UTF-8 version of the original file.
Try running od -c
on the files to see their actual contents.
UPDATE :
It looks like you're on a big-endian machine (x86 is little-endian), and you're trying to generate a little-endian UTF-16 file with a BOM. Is that correct? As far as I can tell, iconv
won't do that directly. But this should work:
( printf "\xff\xfe" ; iconv -f utf-8 -t utf-16le UTF-8-FILE ) > UTF-16-FILE
The behavior of the printf
might depend on your locale settings; I have LANG=en_US.UTF-8
.
(Can anyone suggest a more elegant solution?)
Another workaround, if you know the endianness of the output produced by -t utf-16
:
iconv -f utf-8 -t utf-16 UTF-8-FILE | dd conv=swab 2>/dev/null
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With