Convert UTF8 to UTF16 using iconv

Question

When I use iconv to convert from UTF16 to UTF8 then all is fine but vice versa it does not work. I have these files:

a-16.strings:    Little-endian UTF-16 Unicode c program text a-8.strings:     UTF-8 Unicode c program text, with very long lines

The text look OK in editor. When I run this:

iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16.strings

Then I get this result:

b-16.strings:    data a-16.strings:    Little-endian UTF-16 Unicode c program text a-8.strings:     UTF-8 Unicode c program text, with very long lines

The file utility does not show expected file format and the text does not look good in editor either. Could it be that iconv does not create proper BOM? I run it on MAC command line.

Why is not the b-16 in proper UTF-16LE format? Is there another way of converting utf8 to utf16?

More elaboration is bellow.

$ iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16le-BAD-fromUTF8.strings $ iconv -f UTF-8 -t UTF-16 a-8.strings > b-16be.strings  $ iconv -f UTF-16 -t UTF-16LE b-16be.strings > b-16le-BAD-fromUTF16BE.strings  $ file *s a-16.strings:                   Little-endian UTF-16 Unicode c program text, with very long lines a-8.strings:                    UTF-8 Unicode c program text, with very long lines b-16be.strings:                 Big-endian UTF-16 Unicode c program text, with very long lines b-16le-BAD-fromUTF16BE.strings: data b-16le-BAD-fromUTF8.strings:    data   $ od -c a-16.strings | head 0000000  377 376   /  \0   *  \0      \0  \f 001   E  \0   S  \0   K  \0  $ od -c a-8.strings | head  0000000    /   *   *   *       Č  **   E   S   K   Y       (   J   V   O  $ od -c b-16be.strings | head 0000000  376 377  \0   /  \0   *  \0   *  \0   *  \0     001  \f  \0   E  $ od -c b-16le-BAD-fromUTF16BE.strings | head                                 0000000    /  \0   *  \0   *  \0   *  \0      \0  \f 001   E  \0   S  \0  $ od -c b-16le-BAD-fromUTF8.strings | head 0000000    /  \0   *  \0   *  \0   *  \0      \0  \f 001   E  \0   S  \0

It is clear the BOM is missing whenever I run conversion to UTF-16LE. Any help on this?

Keith Thompson · Accepted Answer

UTF-16LE tells iconv to generate little-endian UTF-16 without a BOM (Byte Order Mark). Apparently it assumes that since you specified LE, the BOM isn't necessary.

UTF-16 tells it to generate UTF-16 text (in the local machine's byte order) with a BOM.

If you're on a little-endian machine, I don't see a way to tell iconv to generate big-endian UTF-16 with a BOM, but I might just be missing something.

I find that the file command doesn't recognize UTF-16 text without a BOM, and your editor might not either. But if you run iconv -f UTF-16LE -t UTF_8 b-16 strings, you should get a valid UTF-8 version of the original file.

Try running od -c on the files to see their actual contents.

UPDATE :

It looks like you're on a big-endian machine (x86 is little-endian), and you're trying to generate a little-endian UTF-16 file with a BOM. Is that correct? As far as I can tell, iconv won't do that directly. But this should work:

( printf "\xff\xfe" ; iconv -f utf-8 -t utf-16le UTF-8-FILE ) > UTF-16-FILE

The behavior of the printf might depend on your locale settings; I have LANG=en_US.UTF-8.

(Can anyone suggest a more elegant solution?)

Another workaround, if you know the endianness of the output produced by -t utf-16:

iconv -f utf-8 -t utf-16 UTF-8-FILE | dd conv=swab 2>/dev/null

Convert UTF8 to UTF16 using iconv

Tags:

linux

command-line

macos

unicode

PerfectGamesOnline.com

1 Answers

Keith Thompson

Recent Activity

Donate For Us

Convert UTF8 to UTF16 using iconv

Tags:

linux

command-line

macos

unicode

PerfectGamesOnline.com

1 Answers

Keith Thompson

Related questions

Recent Activity

Donate For Us