Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert UTF8 to UTF16 using iconv

When I use iconv to convert from UTF16 to UTF8 then all is fine but vice versa it does not work. I have these files:

a-16.strings:    Little-endian UTF-16 Unicode c program text a-8.strings:     UTF-8 Unicode c program text, with very long lines 

The text look OK in editor. When I run this:

iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16.strings 

Then I get this result:

b-16.strings:    data a-16.strings:    Little-endian UTF-16 Unicode c program text a-8.strings:     UTF-8 Unicode c program text, with very long lines 

The file utility does not show expected file format and the text does not look good in editor either. Could it be that iconv does not create proper BOM? I run it on MAC command line.

Why is not the b-16 in proper UTF-16LE format? Is there another way of converting utf8 to utf16?

More elaboration is bellow.

$ iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16le-BAD-fromUTF8.strings $ iconv -f UTF-8 -t UTF-16 a-8.strings > b-16be.strings  $ iconv -f UTF-16 -t UTF-16LE b-16be.strings > b-16le-BAD-fromUTF16BE.strings  $ file *s a-16.strings:                   Little-endian UTF-16 Unicode c program text, with very long lines a-8.strings:                    UTF-8 Unicode c program text, with very long lines b-16be.strings:                 Big-endian UTF-16 Unicode c program text, with very long lines b-16le-BAD-fromUTF16BE.strings: data b-16le-BAD-fromUTF8.strings:    data   $ od -c a-16.strings | head 0000000  377 376   /  \0   *  \0      \0  \f 001   E  \0   S  \0   K  \0  $ od -c a-8.strings | head  0000000    /   *   *   *       Č  **   E   S   K   Y       (   J   V   O  $ od -c b-16be.strings | head 0000000  376 377  \0   /  \0   *  \0   *  \0   *  \0     001  \f  \0   E  $ od -c b-16le-BAD-fromUTF16BE.strings | head                                 0000000    /  \0   *  \0   *  \0   *  \0      \0  \f 001   E  \0   S  \0  $ od -c b-16le-BAD-fromUTF8.strings | head 0000000    /  \0   *  \0   *  \0   *  \0      \0  \f 001   E  \0   S  \0 

It is clear the BOM is missing whenever I run conversion to UTF-16LE. Any help on this?

like image 879
PerfectGamesOnline.com Avatar asked Jan 19 '12 09:01

PerfectGamesOnline.com


1 Answers

UTF-16LE tells iconv to generate little-endian UTF-16 without a BOM (Byte Order Mark). Apparently it assumes that since you specified LE, the BOM isn't necessary.

UTF-16 tells it to generate UTF-16 text (in the local machine's byte order) with a BOM.

If you're on a little-endian machine, I don't see a way to tell iconv to generate big-endian UTF-16 with a BOM, but I might just be missing something.

I find that the file command doesn't recognize UTF-16 text without a BOM, and your editor might not either. But if you run iconv -f UTF-16LE -t UTF_8 b-16 strings, you should get a valid UTF-8 version of the original file.

Try running od -c on the files to see their actual contents.

UPDATE :

It looks like you're on a big-endian machine (x86 is little-endian), and you're trying to generate a little-endian UTF-16 file with a BOM. Is that correct? As far as I can tell, iconv won't do that directly. But this should work:

( printf "\xff\xfe" ; iconv -f utf-8 -t utf-16le UTF-8-FILE ) > UTF-16-FILE 

The behavior of the printf might depend on your locale settings; I have LANG=en_US.UTF-8.

(Can anyone suggest a more elegant solution?)

Another workaround, if you know the endianness of the output produced by -t utf-16:

iconv -f utf-8 -t utf-16 UTF-8-FILE | dd conv=swab 2>/dev/null 
like image 68
Keith Thompson Avatar answered Sep 22 '22 16:09

Keith Thompson