Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert ISO-8859-1 strings to UTF-8 in C/C++

Tags:

You would think this would be readily available, but I'm having a hard time finding a simple library function that will convert a C or C++ string from ISO-8859-1 coding to UTF-8. I'm reading data that is in 8-bit ISO-8859-1 encoding, but need to convert it to a UTF-8 string for use in an SQLite database and eventually an Android app.

I found one commercial product, but it's beyond my budget at this time.

like image 713
gordonwd Avatar asked Oct 30 '10 17:10

gordonwd


People also ask

Is ISO 8859 the same as UTF-8?

UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.

Does STD string support UTF-8?

UTF-8 actually works quite well in std::string . Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.


2 Answers

If your source encoding will always be ISO-8859-1, this is trivial. Here's a loop:

unsigned char *in, *out; while (*in)     if (*in<128) *out++=*in++;     else *out++=0xc2+(*in>0xbf), *out++=(*in++&0x3f)+0x80; 

For safety you need to ensure that the output buffer is twice as large as the input buffer, or else include a size limit and check it in the loop condition.

like image 83
R.. GitHub STOP HELPING ICE Avatar answered Sep 21 '22 11:09

R.. GitHub STOP HELPING ICE


To c++ i use this:

std::string iso_8859_1_to_utf8(std::string &str) {     string strOut;     for (std::string::iterator it = str.begin(); it != str.end(); ++it)     {         uint8_t ch = *it;         if (ch < 0x80) {             strOut.push_back(ch);         }         else {             strOut.push_back(0xc0 | ch >> 6);             strOut.push_back(0x80 | (ch & 0x3f));         }     }     return strOut; } 
like image 36
Lord Raiden Avatar answered Sep 19 '22 11:09

Lord Raiden