I'm working on a english only C++ program for Windows where we were told "always use std::wstring", but it seems like nobody on the team really has much of an understanding beyond that.
I already read the question titled "std::wstring VS std::string. It was very helpful, but I still don't quite understand how to apply all of that information to my problem.
The program I'm working on displays data in a Windows GUI. That data is persisted as XML. We often transform that XML using XSLT into HTML or XSL:FO for reporting purposes.
My feeling based on what I have read is that the HTML should be encoded as UTF-8. I know very little about GUI development, but the little bit I have read indicates that the GUI stuff is all based on UTF-16 encoded strings.
I'm trying to understand where this leaves me. Say we decide that all of our persisted data should be UTF-8 encoded XML. Does this mean that in order to display persisted data in a UI component, I should really be performing some sort of explicit UTF-8 to UTF-16 transcoding process?
I suspect my explanation could use clarification, so I'll try to provide that if you have any questions.
String overview std::string is used for standard ascii and utf-8 strings. std::wstring is used for wide-character/unicode (utf-16) strings. There is no built-in class for utf-32 strings (though you should be able to extend your own from basic_string if you need one).
UTF-8 actually works quite well in std::string . Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.
std::string doesn't have the concept of encodings. It just stores whatever is passed to it. cout <<'è';
UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”
Windows from NT4 onwards is based on Unicode encoded strings, yes. Early versions were based on UCS-2, which is the predecessor of UTF-16, and thus does not support all of the characters that UTF-16 does. Later versions are based on UTF-16. Not all OSes are based on UTF-16/UCS-2, though. *nix systems, for instance, are based on UTF-8 instead.
UTF-8 is a very good choice for storing data persistently. It is a universally supported encoding in all Unicode environments, and it is a good balance between data size and loss-less data compatibility.
Yes, you would have to parse the XML, extract the necessary information from it, and decode and transform it into something the UI can use.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With