We are specifically eyeing Windows and Linux development, and have come up with two differing approaches that both seem to have their merits. The natural unicode string type in Windows is UTF-16, and UTF-8 in linux. We can't decide whether the best approach: <ol> <li>Standardise on one of the two in all our application logic (and persistent data), and make the other platforms do the appropriate conversions</li> <li>Use the natural format for the OS for application logic (and thus making calls into the OS), and convert only at the point of IPC and persistence.</li> </ol> To me they seem like they are both about as good as each other.

<blockquote> and UTF-8 in linux. </blockquote> It's mostly true for modern Linux. Actually encoding depends on what API or library is used. Some hardcoded to use UTF-8. But some read LC_ALL, LC_CTYPE or LANG environment variables to detect encoding to use (like Qt library). So be careful. <blockquote> We can't decide whether the best approach </blockquote> As usual it depends. If 90% of code is to deal with platform specific API in platform specific way, obviously it is better to use platform specific strings. As an example - a device driver or native iOS application. If 90% of code is complex business logic that is shared across platforms, obviously it is better to use same encoding on all platforms. As an example - chat client or browser. In second case you have a choice: <ul> <li>Use cross platform library that provides strings support (Qt, ICU, for example)</li> <li>Use bare pointers (I consider std::string a "bare pointer" too)</li> </ul> If working with strings is a significant part of your application, choosing a nice library for strings is a good move. For example Qt has a very solid set of classes that covers 99% of common tasks. Unfortunately, I has no ICU experience, but it also looks very nice. When using some library for strings you need to care about encoding only when working with external libraries, platform API or sending strings over the net (or disk). For example, a lot of Cocoa, C# or Qt (all has solid strings support) programmers know very little about encoding details (and it is good, since they can focus on their main task). My experience in working with strings is a little specific, so I personally prefer bare pointers. Code that use them is very portable (in sense it can be easily reused in other projects and platforms) because has less external dependencies. It is extremely simple and fast also (but one probably need some experience and Unicode background to feel that). I agree that bare pointers approach is not for everyone. It is good when: <ul> <li>You work with entire strings and splitting, searching, comparing is a rare task</li> <li>You can use same encoding in all components and need a conversion only when using platform API</li> <li>All your supported platforms has API to: <ul> <li>Convert from your encoding to that is used in API</li> <li>Convert from API encoding to that is used in your code</li> </ul> </li> <li>Pointers is not a problem in your team</li> </ul> From my a little specific experience it is actually a very common case. When working with bare pointers it is good to choose encoding that will be used in entire project (or in all projects). From my point of view, UTF-8 is an ultimate winner. If you can't use UTF-8 - use strings library or platform API for strings - it will save you a lot of time. Advantages of UTF-8: <ul> <li>Fully ASCII compatible. Any ASCII string is a valid UTF-8 string.</li> <li>C std library works great with UTF-8 strings. (*)</li> <li>C++ std library works great with UTF-8 (std::string and friends). (*)</li> <li>Legacy code works great with UTF-8.</li> <li>Quite any platform supports UTF-8.</li> <li>Debugging is MUCH easier with UTF-8 (since it is ASCII compatible).</li> <li>No Little-Endian/Big-Endian mess.</li> <li>You will not catch a classical bug "Oh, UTF-16 is not always 2 bytes?".</li> </ul> (*) Until you need to lexical compare them, transform case (toUpper/toLower), change normalization form or something like this - if you do - use strings library or platform API. Disadvantage is questionable: <ul> <li>Less compact for Chinese (and other symbols with large code point numbers) than UTF-16.</li> <li>Harder (a little actually) to iterate over symbols.</li> </ul> So, I recommend to use UTF-8 as common encoding for project(s) that doesn't use any strings library. But encoding is not the only question you need to answer. There is such thing as normalization. To put it simple, some letters can be represented in several ways - like one glyph or like a combination of different glyphs. The common problem with this is that most of string compare functions treat them as different symbols. If you working on cross-platform project, choosing one of normalization forms as standard is a right move. This will save your time. For example if user password contains "йёжиг" it will be differently represented (in both UTF-8 and UTF-16) when entered on Mac (that mostly use Normalization Form D) and on Windows (that mostly likes Normalization Form C). So if user registered under Windows with such password it will a problem for him to login under Mac. In addition I would not recommend to use wchar_t (or use it only in windows code as a UCS-2/UTF-16 char type). The problem with wchar_t is that there is no encoding associated with it. It's just an abstract wide char that is larger than normal char (16 bits on Windows, 32 bits on most *nix).

Cross-platform C++: Use the native string encoding or standardise across platforms?

Tags:

c++

linux

windows

unicode

cross-platform

We are specifically eyeing Windows and Linux development, and have come up with two differing approaches that both seem to have their merits. The natural unicode string type in Windows is UTF-16, and UTF-8 in linux.

We can't decide whether the best approach:

Standardise on one of the two in all our application logic (and persistent data), and make the other platforms do the appropriate conversions
Use the natural format for the OS for application logic (and thus making calls into the OS), and convert only at the point of IPC and persistence.

To me they seem like they are both about as good as each other.

345

asked Apr 02 '12 09:04

Jesse Pepper

1 Answers

and UTF-8 in linux.

It's mostly true for modern Linux. Actually encoding depends on what API or library is used. Some hardcoded to use UTF-8. But some read LC_ALL, LC_CTYPE or LANG environment variables to detect encoding to use (like Qt library). So be careful.

We can't decide whether the best approach

As usual it depends.

If 90% of code is to deal with platform specific API in platform specific way, obviously it is better to use platform specific strings. As an example - a device driver or native iOS application.

If 90% of code is complex business logic that is shared across platforms, obviously it is better to use same encoding on all platforms. As an example - chat client or browser.

In second case you have a choice:

Use cross platform library that provides strings support (Qt, ICU, for example)
Use bare pointers (I consider std::string a "bare pointer" too)

If working with strings is a significant part of your application, choosing a nice library for strings is a good move. For example Qt has a very solid set of classes that covers 99% of common tasks. Unfortunately, I has no ICU experience, but it also looks very nice.

When using some library for strings you need to care about encoding only when working with external libraries, platform API or sending strings over the net (or disk). For example, a lot of Cocoa, C# or Qt (all has solid strings support) programmers know very little about encoding details (and it is good, since they can focus on their main task).

My experience in working with strings is a little specific, so I personally prefer bare pointers. Code that use them is very portable (in sense it can be easily reused in other projects and platforms) because has less external dependencies. It is extremely simple and fast also (but one probably need some experience and Unicode background to feel that).

I agree that bare pointers approach is not for everyone. It is good when:

You work with entire strings and splitting, searching, comparing is a rare task
You can use same encoding in all components and need a conversion only when using platform API
All your supported platforms has API to:
- Convert from your encoding to that is used in API
- Convert from API encoding to that is used in your code
Pointers is not a problem in your team

From my a little specific experience it is actually a very common case.

When working with bare pointers it is good to choose encoding that will be used in entire project (or in all projects).

From my point of view, UTF-8 is an ultimate winner. If you can't use UTF-8 - use strings library or platform API for strings - it will save you a lot of time.

Advantages of UTF-8:

Fully ASCII compatible. Any ASCII string is a valid UTF-8 string.
C std library works great with UTF-8 strings. (*)
C++ std library works great with UTF-8 (std::string and friends). (*)
Legacy code works great with UTF-8.
Quite any platform supports UTF-8.
Debugging is MUCH easier with UTF-8 (since it is ASCII compatible).
No Little-Endian/Big-Endian mess.
You will not catch a classical bug "Oh, UTF-16 is not always 2 bytes?".

(*) Until you need to lexical compare them, transform case (toUpper/toLower), change normalization form or something like this - if you do - use strings library or platform API.

Disadvantage is questionable:

Less compact for Chinese (and other symbols with large code point numbers) than UTF-16.
Harder (a little actually) to iterate over symbols.

So, I recommend to use UTF-8 as common encoding for project(s) that doesn't use any strings library.

But encoding is not the only question you need to answer.

There is such thing as normalization. To put it simple, some letters can be represented in several ways - like one glyph or like a combination of different glyphs. The common problem with this is that most of string compare functions treat them as different symbols. If you working on cross-platform project, choosing one of normalization forms as standard is a right move. This will save your time.

For example if user password contains "йёжиг" it will be differently represented (in both UTF-8 and UTF-16) when entered on Mac (that mostly use Normalization Form D) and on Windows (that mostly likes Normalization Form C). So if user registered under Windows with such password it will a problem for him to login under Mac.

In addition I would not recommend to use wchar_t (or use it only in windows code as a UCS-2/UTF-16 char type). The problem with wchar_t is that there is no encoding associated with it. It's just an abstract wide char that is larger than normal char (16 bits on Windows, 32 bits on most *nix).

answered Oct 13 '22 00:10

wonder.mice

Related questions
                            
                                How do you man cpp functions?
                            
                                KDevelop 4.2.2 syntax highlighting questions
                            
                                Segmentation fault when using dlclose(...) on android platform
                            
                                Problems animating COLLADA Model
                            
                                Best way to get a Matlab <-> C++ interface [closed]
                            
                                How do I use overloaded functions with default arguments in algorithms?
                            
                                Free UML tool, ideally for .NET [closed]
                            
                                C++0x error with constexpr and returning template function
                            
                                Tools to detect False Sharing in a C/C++ application
                            
                                How to reinitialise an embedded Python interpreter?
                            
                                Why is inserting multiple elements into a std::set simultaneously faster?
                            
                                CMake and Visual Studio resource files
                            
                                glPopAttrib & GL_INVALID_OPERATION
                            
                                Is there a template/constexpr/C++11 way of replacing X-macros?
                            
                                Interpret Valgrind's trace-malloc output
                            
                                Is there any free C++ and/or C# compiler that runs on an Android-enabled Tablet PC? [closed]
                            
                                gtk(mm) 3 button background color change
                            
                                Exception handler
                            
                                Using Qt Model/View with non-table like data and non-table/list UI?
                            
                                c++ mp3 library [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With