How do I properly use std::string on UTF-8 in C++?

Q: What does std::string () do?

std::string class in C++ C++ has in its definition a way to represent a sequence of characters as an object of the class. This class is called std:: string. String class stores the characters as a sequence of bytes with the functionality of allowing access to the single-byte character.

Tags:

c++

string

c++11

My platform is a Mac. I'm a C++ beginner and working on a personal project which processes Chinese and English. UTF-8 is the preferred encoding for this project.

I read some posts on Stack Overflow, and many of them suggest using std::string when dealing with UTF-8 and avoid wchar_t as there's no char8_t right now for UTF-8.

However, none of them talk about how to properly deal with functions like str[i], std::string::size(), std::string::find_first_of() or std::regex as these function usually returns unexpected results when facing UTF-8.

Should I go ahead with std::string or switch to std::wstring? If I should stay with std::string, what's the best practice for one to handle the above problems?

894

asked May 18 '18 03:05

Saddle Point

1 Answers

Unicode Glossary

Unicode is a vast and complex topic. I do not wish to wade too deep there, however a quick glossary is necessary:

Code Points: Code Points are the basic building blocks of Unicode, a code point is just an integer mapped to a meaning. The integer portion fits into 32 bits (well, 24 bits really), and the meaning can be a letter, a diacritic, a white space, a sign, a smiley, half a flag, ... and it can even be "the next portion reads right to left".
Grapheme Clusters: Grapheme Clusters are groups of semantically related Code Points, for example a flag in unicode is represented by associating two Code Points; each of those two, in isolation, has no meaning, but associated together in a Grapheme Cluster they represent a flag. Grapheme Clusters are also used to pair a letter with a diacritic in some scripts.

This is the basic of Unicode. The distinction between Code Point and Grapheme Cluster can be mostly glossed over because for most modern languages each "character" is mapped to a single Code Point (there are dedicated accented forms for commonly used letter+diacritic combinations). Still, if you venture in smileys, flags, etc... then you may have to pay attention to the distinction.

UTF Primer

Then, a serie of Unicode Code Points has to be encoded; the common encodings are UTF-8, UTF-16 and UTF-32, the latter two existing in both Little-Endian and Big-Endian forms, for a total of 5 common encodings.

In UTF-X, X is the size in bits of the Code Unit, each Code Point is represented as one or several Code Units, depending on its magnitude:

UTF-8: 1 to 4 Code Units,
UTF-16: 1 or 2 Code Units,
UTF-32: 1 Code Unit.

`std::string` and `std::wstring`.

Do not use std::wstring if you care about portability (wchar_t is only 16 bits on Windows); use std::u32string instead (aka std::basic_string<char32_t>).
The in-memory representation (std::string or std::wstring) is independent of the on-disk representation (UTF-8, UTF-16 or UTF-32), so prepare yourself for having to convert at the boundary (reading and writing).
While a 32-bits wchar_t ensures that a Code Unit represents a full Code Point, it still does not represent a complete Grapheme Cluster.

If you are only reading or composing strings, you should have no to little issues with std::string or std::wstring.

Troubles start when you start slicing and dicing, then you have to pay attention to (1) Code Point boundaries (in UTF-8 or UTF-16) and (2) Grapheme Clusters boundaries. The former can be handled easily enough on your own, the latter requires using a Unicode aware library.

Picking `std::string` or `std::u32string`?

If performance is a concern, it is likely that std::string will perform better due to its smaller memory footprint; though heavy use of Chinese may change the deal. As always, profile.

If Grapheme Clusters are not a problem, then std::u32string has the advantage of simplifying things: 1 Code Unit -> 1 Code Point means that you cannot accidentally split Code Points, and all the functions of std::basic_string work out of the box.

If you interface with software taking std::string or char*/char const*, then stick to std::string to avoid back-and-forth conversions. It'll be a pain otherwise.

UTF-8 in `std::string`.

UTF-8 actually works quite well in std::string.

Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.

Due the way Code Points are encoded, looking for a Code Point cannot accidentally match the middle of another Code Point:

str.find('\n') works,
str.find("...") works for matching byte by byte¹,
str.find_first_of("\r\n") works if searching for ASCII characters.

Similarly, regex should mostly works out of the box. As a sequence of characters ("haha") is just a sequence of bytes ("哈"), basic search patterns should work out of the box.

Be wary, however, of character classes (such as [:alphanum:]), as depending on the regex flavor and implementation it may or may not match Unicode characters.

Similarly, be wary of applying repeaters to non-ASCII "characters", "哈?" may only consider the last byte to be optional; use parentheses to clearly delineate the repeated sequence of bytes in such cases: "(哈)?".

¹The key concepts to look-up are normalization and collation; this affects all comparison operations. std::string will always compare (and thus sort) byte by byte, without regard for comparison rules specific to a language or a usage. If you need to handle full normalization/collation, you need a complete Unicode library, such as ICU.

answered Oct 05 '22 23:10

Matthieu M.

Related questions
                            
                                What is the fastest way to compute sin and cos together?
                            
                                How to track down a "double free or corruption" error
                            
                                Should I inherit from std::exception?
                            
                                Can I implement an autonomous `self` member type in C++?
                            
                                Make a program run slowly
                            
                                Convert float to string with precision & number of decimal digits specified?
                            
                                Timer function to provide time in nano seconds using C++
                            
                                When do I really need to use atomic<bool> instead of bool? [duplicate]
                            
                                How to overcome "'aclocal-1.15' is missing on your system" warning?
                            
                                Have there ever been silent behavior changes in C++ with new standard versions?
                            
                                Why is std::is_pod deprecated in C++20?
                            
                                Is there a range class in C++11 for use with range based for loops?
                            
                                Why do I get "unresolved external symbol" errors when using templates? [duplicate]
                            
                                Guaranteed lifetime of temporary in C++?
                            
                                Does C++ have a package manager like npm, pip, gem, etc? [closed]
                            
                                C++: Print out enum value as text
                            
                                how to find the intersection of two std::set in C++?
                            
                                Order of evaluation in C++ function parameters
                            
                                Are static fields inherited?
                            
                                Does Qt support virtual pure slots?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I properly use std::string on UTF-8 in C++?

Tags:

c++

string

c++11

Saddle Point

People also ask

1 Answers

Unicode Glossary

UTF Primer

`std::string` and `std::wstring`.

Picking `std::string` or `std::u32string`?

UTF-8 in `std::string`.

Matthieu M.

Recent Activity

Donate For Us

How do I properly use std::string on UTF-8 in C++?

Tags:

c++

string

c++11

Saddle Point

People also ask

1 Answers

Unicode Glossary

UTF Primer

std::string and std::wstring.

Picking std::string or std::u32string?

UTF-8 in std::string.

Matthieu M.

Related questions

Recent Activity

Donate For Us

`std::string` and `std::wstring`.

Picking `std::string` or `std::u32string`?

UTF-8 in `std::string`.