I'm trying to use iconv(3) to convert a wide-character string to UTF-8 using the code below. When I run the below, the iconv call returns E2BIG, as if there were not enough bytes of space available in the output buffer. This occurs despite the fact that (I think) I have sized the output buffer to admit the worst-case expansion for UTF-8. In fact, given that the input is a simple ASCII 'A' encoded as a wchar_t followed by a zero wchar_t terminator, the output should be exactly two bytes/chars: an 'A' followed by a '\0'.
'man utf-8' on my Linux system says that the maximum length of a UTF-8 byte sequence is 6 bytes, so I believe that for an input buffer of 2 wchar_ts (a character followed by the null terminator), making (on my system) 8 bytes total (since sizeof(wchar_t) == 4), a buffer of 12 bytes (2 * UTF8_SEQUENCE_MAXLEN) should be sufficient.
By experiment, if I increase UTF8_SEQUENCE_MAXLEN to 16, iconv's return value indicates success (15 still fails). But I cannot see any way that any wchar_t value would occupy so many bytes when encoded in UTF-8.
Have I gone wrong in my calculations? Are 16-byte UTF-8 sequences possible? What have I done wrong?
#include <stdio.h>
#include <stdlib.h>
#include <iconv.h>
#include <wchar.h>
#define UTF8_SEQUENCE_MAXLEN 6
/* #define UTF8_SEQUENCE_MAXLEN 16 */
int
main(int argc, char **argv)
{
wchar_t *wcs = L"A";
signed char utf8[(1 /* wcslen(wcs) */ + 1 /* L'\0' */) * UTF8_SEQUENCE_MAXLEN];
char *iconv_in = (char *) wcs;
char *iconv_out = (char *) &utf8[0];
size_t iconv_in_bytes = (wcslen(wcs) + 1 /* L'\0' */) * sizeof(wchar_t);
size_t iconv_out_bytes = sizeof(utf8);
size_t ret;
iconv_t cd;
cd = iconv_open("WCHAR_T", "UTF-8");
if ((iconv_t) -1 == cd) {
perror("iconv_open");
return EXIT_FAILURE;
}
ret = iconv(cd, &iconv_in, &iconv_in_bytes, &iconv_out, &iconv_out_bytes);
if ((size_t) -1 == ret) {
perror("iconv");
return EXIT_FAILURE;
}
return EXIT_SUCCESS;
}
The arguments to iconv_open are the wrong way around. The order of arguments is (to, from), not (from, to), as is clearly stated in the manpage.
Consequently, changing
iconv_open("WCHAR_T", "UTF-8");
to
iconv_open("UTF-8", "WCHAR_T");
causes the (otherwise unchanged) code above to work as expected.
D'oh. Need to read manpages more closely.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With