My platform is a Mac. I'm a C++ beginner and working on a personal project which processes Chinese and English. UTF-8 is the preferred encoding for this project.
I read some posts on Stack Overflow, and many of them suggest using std::string
when dealing with UTF-8 and avoid wchar_t
as there's no char8_t
right now for UTF-8.
However, none of them talk about how to properly deal with functions like str[i]
, std::string::size()
, std::string::find_first_of()
or std::regex
as these function usually returns unexpected results when facing UTF-8.
Should I go ahead with std::string
or switch to std::wstring
? If I should stay with std::string
, what's the best practice for one to handle the above problems?
@MSalters: std::string can hold 100% of all Unicode characters, even if CHAR_BIT is 8. It depends on the encoding of std::string, which may be UTF-8 on the system level (like almost everywhere except for windows) or on your application level.
In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.
std::string doesn't have the concept of encodings. It just stores whatever is passed to it. cout <<'è';
std::string class in C++ C++ has in its definition a way to represent a sequence of characters as an object of the class. This class is called std:: string. String class stores the characters as a sequence of bytes with the functionality of allowing access to the single-byte character.
Unicode is a vast and complex topic. I do not wish to wade too deep there, however a quick glossary is necessary:
This is the basic of Unicode. The distinction between Code Point and Grapheme Cluster can be mostly glossed over because for most modern languages each "character" is mapped to a single Code Point (there are dedicated accented forms for commonly used letter+diacritic combinations). Still, if you venture in smileys, flags, etc... then you may have to pay attention to the distinction.
Then, a serie of Unicode Code Points has to be encoded; the common encodings are UTF-8, UTF-16 and UTF-32, the latter two existing in both Little-Endian and Big-Endian forms, for a total of 5 common encodings.
In UTF-X, X is the size in bits of the Code Unit, each Code Point is represented as one or several Code Units, depending on its magnitude:
std::string
and std::wstring
.std::wstring
if you care about portability (wchar_t
is only 16 bits on Windows); use std::u32string
instead (aka std::basic_string<char32_t>
).std::string
or std::wstring
) is independent of the on-disk representation (UTF-8, UTF-16 or UTF-32), so prepare yourself for having to convert at the boundary (reading and writing).wchar_t
ensures that a Code Unit represents a full Code Point, it still does not represent a complete Grapheme Cluster.If you are only reading or composing strings, you should have no to little issues with std::string
or std::wstring
.
Troubles start when you start slicing and dicing, then you have to pay attention to (1) Code Point boundaries (in UTF-8 or UTF-16) and (2) Grapheme Clusters boundaries. The former can be handled easily enough on your own, the latter requires using a Unicode aware library.
std::string
or std::u32string
?If performance is a concern, it is likely that std::string
will perform better due to its smaller memory footprint; though heavy use of Chinese may change the deal. As always, profile.
If Grapheme Clusters are not a problem, then std::u32string
has the advantage of simplifying things: 1 Code Unit -> 1 Code Point means that you cannot accidentally split Code Points, and all the functions of std::basic_string
work out of the box.
If you interface with software taking std::string
or char*
/char const*
, then stick to std::string
to avoid back-and-forth conversions. It'll be a pain otherwise.
std::string
.UTF-8 actually works quite well in std::string
.
Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.
Due the way Code Points are encoded, looking for a Code Point cannot accidentally match the middle of another Code Point:
str.find('\n')
works,str.find("...")
works for matching byte by byte1,str.find_first_of("\r\n")
works if searching for ASCII characters.Similarly, regex
should mostly works out of the box. As a sequence of characters ("haha"
) is just a sequence of bytes ("哈"
), basic search patterns should work out of the box.
Be wary, however, of character classes (such as [:alphanum:]
), as depending on the regex flavor and implementation it may or may not match Unicode characters.
Similarly, be wary of applying repeaters to non-ASCII "characters", "哈?"
may only consider the last byte to be optional; use parentheses to clearly delineate the repeated sequence of bytes in such cases: "(哈)?"
.
1The key concepts to look-up are normalization and collation; this affects all comparison operations. std::string
will always compare (and thus sort) byte by byte, without regard for comparison rules specific to a language or a usage. If you need to handle full normalization/collation, you need a complete Unicode library, such as ICU.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With