Strings and character encoding in C++

Tags:

I read a few posts about best practices for strings and character encoding in C++, but I am struggling a bit with finding a general purpose approach that seems to me reasonably simple and correct. Could I ask for comments on the following? I'm inclined to use UTF-8 and UTF-32, and to define something like:

typedef std::string string8;
typedef std::basic_string<uint32_t> string32;

The string8 class would be used for UTF-8, and having a separate type is just a reminder of the encoding. An alternative would be for string8 to be a subclass of std::string and to remove the methods that aren't quite right for UTF-8.

The string32 class would be used for UTF-32 when a fixed character size is desired.

The UTF-8 CPP functions, utf8::utf8to32() and utf8::utf32to8(), or even simpler wrapper functions, would be used to convert between the two.

637

asked Oct 16 '10 20:10

nassar

1 Answers

If you plan on just passing strings around and never inspect them, you can use plain std::string though it's a poor man job.

The issue is that most frameworks, even the standard, have stupidly (I think) enforced encoding in memory. I say stupid because encoding should only matter on the interface, and those encoding are not adapted for in-memory manipulation of the data.

Furthermore, encoding is easy (it's a simple transposition CodePoint -> bytes and reversely) while the main difficulty is actually about manipulating the data.

With a 8-bits or 16-bits you run the risk of cutting a character in the middle because neither std::string nor std::wstring are aware of what a Unicode Character is. Worse, even with a 32-bits encoding, there is the risk of separating a character from the diacritics that apply to it, which is also stupid.

The support of Unicode in C++ is therefore extremely subpar, as far as the standard is concerned.

If you really wish to manipulate Unicode string, you need a Unicode aware container. The usual way is to use the ICU library, though its interface is really C-ish. However you'll get everything you need to actually work in Unicode with multiple languages.

157

answered Sep 18 '22 09:09

Matthieu M.

Related questions
                            
                                Is there any way to building static Qt with static OpenSSL?
                            
                                Where to draw the line between size_t and unsigned int? [duplicate]
                            
                                Is it OK to define operator<< or operator>> for FILE&?
                            
                                Generic way of lazily evaluating (short-circuiting) template conditional types
                            
                                How do I convert an armadillo matrix to a vector of vectors?
                            
                                Does the function override the base function?
                            
                                Constant expression initializer for static class member of type double
                            
                                emplace_back() issue under VS2013
                            
                                std::regex, to match begin/end of string
                            
                                atoi() for int128_t type
                            
                                Why does clang still need libgcc.a to compile my code?
                            
                                What is the purpose of std::forward()'s rvalue reference overload?
                            
                                Does malloc return an "invalid pointer value" in C++17? [duplicate]
                            
                                How do c++ compilers find an extern variable?
                            
                                What's a good way to store a small, fixed size, hierarchical set of static data?
                            
                                Profiler for Visual Studio 2008, C++?
                            
                                How do I examine the contents of an std::vector in gdb, using the icc compiler?
                            
                                Singleton - Why use classes?
                            
                                Class template specializations with shared functionality
                            
                                Unsequenced value computations (a.k.a sequence points)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Strings and character encoding in C++

Tags:

c++

string

character-encoding

unicode

utf-8

nassar

People also ask

1 Answers

Matthieu M.

Recent Activity

Donate For Us