I'm trying to convert a UTF-8 string
to a ISO-8859-1 char*
for use in legacy code. The only way I'm seeing to do this is with iconv
.
I would definitely prefer a completely string
-based C++ solution then just call .c_str()
on the resulting string.
How do I do this? Code example if possible, please. I'm fine using iconv
if it is the only solution you know.
byte[] utf8 = ... byte[] latin1 = new String(utf8, "UTF-8"). getBytes("ISO-8859-1"); You can exercise more control by using the lower-level Charset APIs. For example, you can raise an exception when an un-encodable character is found, or use a different character for replacement text.
UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.
ISO-8859-1 was (according to the standard, at least) the default encoding of documents delivered via HTTP with a MIME type beginning with "text/" (HTML5 changed this to Windows-1252). As of September 2022, 1.3% of all (but only 5 of the top 1000) Web sites use ISO/IEC 8859-1.
UTF-8 is the universal code page for internationalization and is able to encode the entire Unicode character set. It is used pervasively on the web, and is the default for *nix-based platforms.
I'm going to modify my code from another answer to implement the suggestion from Alf.
std::string UTF8toISO8859_1(const char * in)
{
std::string out;
if (in == NULL)
return out;
unsigned int codepoint;
while (*in != 0)
{
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
{
if (codepoint <= 255)
{
out.append(1, static_cast<char>(codepoint));
}
else
{
// do whatever you want for out-of-bounds characters
}
}
}
return out;
}
Invalid UTF-8 input results in dropped characters.
First convert UTF-8 to 32-bit Unicode.
Then keep the values that are in the range 0 through 255.
Those are the Latin-1 code points, and for other values, decide if you want to treat that as an error or perhaps replace with code point 127 (my fav, the ASCII "del") or question mark or something.
The C++ standard library defines a std::codecvt
specialization that can be used,
template<>
codecvt<char32_t, char, mbstate_t>
C++11 §22.4.1.4/3: “the specialization codecvt <char32_t, char, mbstate_t>
converts between the UTF-32 and
UTF-8 encoding schemes”
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With