Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use iconv(3) to convert wide string to UTF-8?

I'm trying to use iconv(3) to convert a wide-character string to UTF-8 using the code below. When I run the below, the iconv call returns E2BIG, as if there were not enough bytes of space available in the output buffer. This occurs despite the fact that (I think) I have sized the output buffer to admit the worst-case expansion for UTF-8. In fact, given that the input is a simple ASCII 'A' encoded as a wchar_t followed by a zero wchar_t terminator, the output should be exactly two bytes/chars: an 'A' followed by a '\0'.

'man utf-8' on my Linux system says that the maximum length of a UTF-8 byte sequence is 6 bytes, so I believe that for an input buffer of 2 wchar_ts (a character followed by the null terminator), making (on my system) 8 bytes total (since sizeof(wchar_t) == 4), a buffer of 12 bytes (2 * UTF8_SEQUENCE_MAXLEN) should be sufficient.

By experiment, if I increase UTF8_SEQUENCE_MAXLEN to 16, iconv's return value indicates success (15 still fails). But I cannot see any way that any wchar_t value would occupy so many bytes when encoded in UTF-8.

Have I gone wrong in my calculations? Are 16-byte UTF-8 sequences possible? What have I done wrong?

#include <stdio.h>
#include <stdlib.h>
#include <iconv.h>
#include <wchar.h>

#define UTF8_SEQUENCE_MAXLEN 6
/* #define UTF8_SEQUENCE_MAXLEN 16 */

int
main(int argc, char **argv)
{
    wchar_t *wcs = L"A";
    signed char utf8[(1 /* wcslen(wcs) */ + 1 /* L'\0' */) * UTF8_SEQUENCE_MAXLEN];
    char *iconv_in = (char *) wcs;
    char *iconv_out = (char *) &utf8[0];
    size_t iconv_in_bytes = (wcslen(wcs) + 1 /* L'\0' */) * sizeof(wchar_t);
    size_t iconv_out_bytes = sizeof(utf8);
    size_t ret;
    iconv_t cd;

    cd = iconv_open("WCHAR_T", "UTF-8");
    if ((iconv_t) -1 == cd) {
        perror("iconv_open");
        return EXIT_FAILURE;
    }

    ret = iconv(cd, &iconv_in, &iconv_in_bytes, &iconv_out, &iconv_out_bytes);
    if ((size_t) -1 == ret) {
        perror("iconv");
        return EXIT_FAILURE;
    }

    return EXIT_SUCCESS;
}
like image 292
AnotherSmellyGeek Avatar asked Nov 03 '13 09:11

AnotherSmellyGeek


1 Answers

The arguments to iconv_open are the wrong way around. The order of arguments is (to, from), not (from, to), as is clearly stated in the manpage.

Consequently, changing

iconv_open("WCHAR_T", "UTF-8");

to

iconv_open("UTF-8", "WCHAR_T");

causes the (otherwise unchanged) code above to work as expected.

D'oh. Need to read manpages more closely.

like image 140
AnotherSmellyGeek Avatar answered Oct 05 '22 10:10

AnotherSmellyGeek