Looking at the unicode standard, they recommend to use plain char
s for storing UTF-8 encoded strings. Does this work as expected with C++ and the basic std::string
, or do cases exist in which the UTF-8 encoding can create problems?
For example, when computing the length, it may not be identical to the number of bytes - how is this supposed to be handled? Reading the standard, I'm probably fine using a char
array for storage, but I'll still need to write functions like strlen
etc. on my own, which work on encoded text, cause as far as I understand the problem, the standard routines are either ASCII only, or expect wide literals (16bit or more), which are not recommended by the unicode standard. So far, the best source I found about the encoding stuff is a post on Joel's on Software, but it does not explain what we poor C++ developer should use :)
There's a library called "UTF8-CPP", which lets you store your UTF-8 strings in standard std::string objects, and provides additional functions to enumerate and manipulate utf-8 characters.
I haven't tested it yet, so I don't know what it's worth, but I am considering using it myself.
strlen counts the number of non-null chars before the first \0. In UTF-8, that count is a sane number (number of bytes used), but the count is not the number of characters (one UTF-8 character is typically 1-4 chars). basic_string doesn't store a \0, but it too keeps a byte count.
strcpy or the basic_string copy ctor copy all bytes without looking too closely.
Finding a substring works OK, because of the way UTF_8 is encoded. The allowed values for the first byte of a character is distinct from the second to 4th byte (the former never start with 10xxxxxx, the latter always)
Taking a substring is tricky - how do you specify the position? If the begin and end were found by searching for ASCII text markers (e.g. [ and ]) then there's no problem. You'd just get the bytes in the middle, which are a valid UTF8 string too. You can't harcode positions, or even relative offsets though. Even a relative offset of +1 character can be hard; how many bytes is that? You will end up writing a function like SkipOneChar.
An example with ICU library (C, C++, Java):
#include <iostream>
#include <unicode/unistr.h> // using ICU library
int main(int argc, char *argv[]) {
// constructing a Unicode string
UnicodeString ustr1("Привет"); // using platform's default codepage
// calculating the length in characters, should be 6
int ulen1=ustr1.length();
// extracting encoded characters from a string
int const bufsize=25;
char encoded[bufsize];
ustr1.extract(0,ulen1,encoded,bufsize,"UTF-8"); // forced UTF-8 encoding
// printing the result
std::cout << "Length of " << encoded << " is " << ulen1 << "\n";
return 0;
}
building like
$ g++ -licuuc -o icu-example{,.cc}
running
$ ./icu-example
Length of Привет is 6
Works for me on Linux with GCC 4.3.2 and libicu 3.8.1. Please note that it prints in UTF-8 no matter what the system locale is. You won't see it correctly if yours is not UTF-8.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With