Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert (char *) from ISO-8859-1 to UTF-8 in C++ multiplatformly?

I'm changing a software in C++, wich process texts in ISO Latin 1 format, to store data in a database in SQLite.
The problem is that SQLite works in UTF-8... and the Java modules that use same database work in UTF-8.

I wanted to have a way to convert the ISO Latin 1 characters to UTF-8 characters before storing in the database. I need it to work in Windows and Mac.

I heard ICU would do that, but I think it's too bloated. I just need a simple convertion system(preferably back and forth) for these 2 charsets.

How would I do that?

like image 987
gabriel Avatar asked Apr 07 '11 19:04

gabriel


3 Answers

ISO-8859-1 was incorporated as the first 256 code points of ISO/IEC 10646 and Unicode. So the conversion is pretty simple.

for each char:

uint8_t ch = code_point; /* assume that code points above 0xff are impossible since latin-1 is 8-bit */

if(ch < 0x80) {
    append(ch);
} else {
    append(0xc0 | (ch & 0xc0) >> 6); /* first byte, simplified since our range is only 8-bits */
    append(0x80 | (ch & 0x3f));
}

See http://en.wikipedia.org/wiki/UTF-8#Description for more details.

EDIT: according to a comment by ninjalj, latin-1 translates direclty to the first 256 unicode code points, so the above algorithm should work.

like image 185
Evan Teran Avatar answered Oct 05 '22 08:10

Evan Teran


TO c++ i use this:

std::string iso_8859_1_to_utf8(std::string &str)
{
    string strOut;
    for (std::string::iterator it = str.begin(); it != str.end(); ++it)
    {
        uint8_t ch = *it;
        if (ch < 0x80) {
            strOut.push_back(ch);
        }
        else {
            strOut.push_back(0xc0 | ch >> 6);
            strOut.push_back(0x80 | (ch & 0x3f));
        }
    }
    return strOut;
}
like image 31
Lord Raiden Avatar answered Oct 05 '22 08:10

Lord Raiden


If general-purpose charset frameworks (like iconv) are too bloated for you, roll your own.

Compose a static translation table (char to UTF-8 sequence), put together your own translation. Depending on what do you use for string storage (char buffers, or std::string or what) it would look somewhat differently, but the idea is - scroll through the source string, replace each character with code over 127 with its UTF-8 counterpart string. Since this can potentially increase string length, doing it in place would be rather inconvenient. For added benefit, you can do it in two passes: pass one determines the necessary target string size, pass two performs the translation.

like image 37
Seva Alekseyev Avatar answered Oct 05 '22 07:10

Seva Alekseyev