You would think this would be readily available, but I'm having a hard time finding a simple library function that will convert a C or C++ string from ISO-8859-1 coding to UTF-8. I'm reading data that is in 8-bit ISO-8859-1 encoding, but need to convert it to a UTF-8 string for use in an SQLite database and eventually an Android app.
I found one commercial product, but it's beyond my budget at this time.
UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.
UTF-8 actually works quite well in std::string . Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.
If your source encoding will always be ISO-8859-1, this is trivial. Here's a loop:
unsigned char *in, *out; while (*in) if (*in<128) *out++=*in++; else *out++=0xc2+(*in>0xbf), *out++=(*in++&0x3f)+0x80;
For safety you need to ensure that the output buffer is twice as large as the input buffer, or else include a size limit and check it in the loop condition.
To c++ i use this:
std::string iso_8859_1_to_utf8(std::string &str) { string strOut; for (std::string::iterator it = str.begin(); it != str.end(); ++it) { uint8_t ch = *it; if (ch < 0x80) { strOut.push_back(ch); } else { strOut.push_back(0xc0 | ch >> 6); strOut.push_back(0x80 | (ch & 0x3f)); } } return strOut; }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With