I have string which fill up by data from other program and this data can be with UTF8 encoding or not. So if not i can encode to UTF8 but what is the best way to detect UTF8 in the C++? I saw this variant https://stackoverflow.com/questions/... but there are comments which said that this solutions give not 100% detection. So if i do encoding to UTF8 string which already contain UTF8 data then i write wrong text to database.
So can i just use this UTF8 detection :
bool is_utf8(const char * string)
{
if(!string)
return 0;
const unsigned char * bytes = (const unsigned char *)string;
while(*bytes)
{
if( (// ASCII
// use bytes[0] <= 0x7F to allow ASCII control characters
bytes[0] == 0x09 ||
bytes[0] == 0x0A ||
bytes[0] == 0x0D ||
(0x20 <= bytes[0] && bytes[0] <= 0x7E)
)
) {
bytes += 1;
continue;
}
if( (// non-overlong 2-byte
(0xC2 <= bytes[0] && bytes[0] <= 0xDF) &&
(0x80 <= bytes[1] && bytes[1] <= 0xBF)
)
) {
bytes += 2;
continue;
}
if( (// excluding overlongs
bytes[0] == 0xE0 &&
(0xA0 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF)
) ||
(// straight 3-byte
((0xE1 <= bytes[0] && bytes[0] <= 0xEC) ||
bytes[0] == 0xEE ||
bytes[0] == 0xEF) &&
(0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF)
) ||
(// excluding surrogates
bytes[0] == 0xED &&
(0x80 <= bytes[1] && bytes[1] <= 0x9F) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF)
)
) {
bytes += 3;
continue;
}
if( (// planes 1-3
bytes[0] == 0xF0 &&
(0x90 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
(0x80 <= bytes[3] && bytes[3] <= 0xBF)
) ||
(// planes 4-15
(0xF1 <= bytes[0] && bytes[0] <= 0xF3) &&
(0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
(0x80 <= bytes[3] && bytes[3] <= 0xBF)
) ||
(// plane 16
bytes[0] == 0xF4 &&
(0x80 <= bytes[1] && bytes[1] <= 0x8F) &&
(0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
(0x80 <= bytes[3] && bytes[3] <= 0xBF)
)
) {
bytes += 4;
continue;
}
return 0;
}
return 1;
}
And this code for encoding to UTF8 if detection is not true :
string text;
if(!is_utf8(EscReason.c_str()))
{
int size = MultiByteToWideChar(CP_ACP, MB_COMPOSITE, text.c_str(),
text.length(), 0, 0);
std::wstring utf16_str(size, '\0');
MultiByteToWideChar(CP_ACP, MB_COMPOSITE, text.c_str(),
text.length(), &utf16_str[0], size);
int utf8_size = WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
utf16_str.length(), 0, 0, 0, 0);
std::string utf8_str(utf8_size, '\0');
WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
utf16_str.length(), &utf8_str[0], utf8_size, 0, 0);
text = utf8_str;
}
Or code above is not done properly? Also i do it in the Windows 7. And how about Ubuntu? Does this variant work there?
Comparing whole byte values is not the correct way to detect UTF-8. You have to analyze the actual bit patterns of each byte. UTF-8 uses a very distinct bit pattern that no other encoding uses. Try something more like this instead:
bool is_utf8(const char * string)
{
if (!string)
return true;
const unsigned char * bytes = (const unsigned char *)string;
int num;
while (*bytes != 0x00)
{
if ((*bytes & 0x80) == 0x00)
{
// U+0000 to U+007F
num = 1;
}
else if ((*bytes & 0xE0) == 0xC0)
{
// U+0080 to U+07FF
num = 2;
}
else if ((*bytes & 0xF0) == 0xE0)
{
// U+0800 to U+FFFF
num = 3;
}
else if ((*bytes & 0xF8) == 0xF0)
{
// U+10000 to U+10FFFF
num = 4;
}
else
return false;
bytes += 1;
for (int i = 1; i < num; ++i)
{
if ((*bytes & 0xC0) != 0x80)
return false;
bytes += 1;
}
}
return true;
}
Now, this does not take into account illegal UTF-8 sequences, such as overlong encodings, UTF-16 surrogates, and codepoints above U+10FFFF. If you want to make sure the UTF-8 is both valid and correct, you would need something more like this:
bool is_valid_utf8(const char * string)
{
if (!string)
return true;
const unsigned char * bytes = (const unsigned char *)string;
unsigned int cp;
int num;
while (*bytes != 0x00)
{
if ((*bytes & 0x80) == 0x00)
{
// U+0000 to U+007F
cp = (*bytes & 0x7F);
num = 1;
}
else if ((*bytes & 0xE0) == 0xC0)
{
// U+0080 to U+07FF
cp = (*bytes & 0x1F);
num = 2;
}
else if ((*bytes & 0xF0) == 0xE0)
{
// U+0800 to U+FFFF
cp = (*bytes & 0x0F);
num = 3;
}
else if ((*bytes & 0xF8) == 0xF0)
{
// U+10000 to U+10FFFF
cp = (*bytes & 0x07);
num = 4;
}
else
return false;
bytes += 1;
for (int i = 1; i < num; ++i)
{
if ((*bytes & 0xC0) != 0x80)
return false;
cp = (cp << 6) | (*bytes & 0x3F);
bytes += 1;
}
if ((cp > 0x10FFFF) ||
((cp >= 0xD800) && (cp <= 0xDFFF)) ||
((cp <= 0x007F) && (num != 1)) ||
((cp >= 0x0080) && (cp <= 0x07FF) && (num != 2)) ||
((cp >= 0x0800) && (cp <= 0xFFFF) && (num != 3)) ||
((cp >= 0x10000) && (cp <= 0x1FFFFF) && (num != 4)))
return false;
}
return true;
}
You probably don't understand UTF-8 and the alternatives. There are only 256 possible values for a byte. That's not a lot, given the number of characters. As a result, many byte sequences are both valid UTF-8 strings and valid strings in other encodings.
In fact, every ASCII string is intentionally a valid UTF-8 string with essentially the same meaning. Your code would return true
for ìs_utf8("Hello")
.
Even many other non-UTF8, non-ASCII strings share a byte sequences with valid UTF-8 strings. And there's just no way to convert a non-UTF-8 string to UTF-8 without knowing exactly what kind of non-UTF-8 encoding it is. Even Latin-1 and Latin-2 are already quite different. CP_ACP
is even worse than Latin-1, CP_ACP
isn't even the same everywhere.
Your text must go into the database as UTF-8. Thus, if it isn't yet UTF-8, it must be converted, and you must know the exact source encoding. There is no magical escape.
On Linux, iconv
is the usual method to convert between 2 encodings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With