Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exceptions with Unicode what()

Tags:

c++

c++11

Or, "how do Russians throw exceptions?"

The definition of std::exception is:

namespace std {
  class exception {
  public:
    exception() throw();
    exception(const exception&) throw();
    exception& operator=(const exception&) throw();
    virtual ~exception() throw();
    virtual const char* what() const throw();
  };
}

A popular school of thought for designing exception hierarchies is to derive from std::exception:

Generally, it's best to throw objects, not built-ins. If possible, you should throw instances of classes that derive (ultimately) from the std::exception class. By making your exception class inherit (ultimately) from the standard exception base-class, you are making life easier for your users (they have the option of catching most things via std::exception), plus you are probably providing them with more information (such as the fact that your particular exception might be a refinement of std::runtime_error or whatever).

But in the face of Unicode, it seems to be impossible to design an exception hierarchy that achieves both of the following:

  • Derives ultimately from std::exception for ease of use at the catch site
  • Provides Unicode compatibility so that diagnostics are not sliced or gibberish

Coming up with an exception class that can be constructed with Unicode strings is simple enough. But the standard dictates that what() must return a const char*, so at some point the input strings must be converted to ASCII. Whether that is done at construction time or when what() is called (if the source string uses characters not representable by 7-bit ASCII), it might be impossible to format the message without loss of fidelity.

How do you design an exception hierarchy that combines the seamless integration of a std::exception-derived class with lossless Unicode diagnostics?

like image 365
John Dibling Avatar asked Sep 21 '10 13:09

John Dibling


3 Answers

char* does not mean ASCII. You could use an 8 bit Unicode encoding like UTF-8. char could also be 16 bit or more, you could then use UTF-16.

like image 106
TheFogger Avatar answered Oct 23 '22 11:10

TheFogger


Returning UTF-8 is an obvious choice. If the application that uses your exceptions uses a different multibyte encoding, it might have a hard time displaying the string though. (It can't know it's UTF-8, can it?) On the other hand, for ISO-8859-* 8bit encodings (Western european, cyrillic, etc.) displaying a UTF-8 string will "just" display some gibberish and you (or your user) might be fine with that if you cannot disambiguate btw. a char* in the locale character set and UTF-8.

Personally I think only low level error messages should go into what() strings and personally I think these should be english anyway. (Maybe combined with some error number or whatnot.)

The worst problem I see with what() is that it is not uncommon to include some contextual details in the what() message, for example a filename. Filenames are non ASCII rather often, so you are left with no choice but to use UTF-8 as the what() encoding.

Note also that your exception class (that's derived from std::exception) can obviously provide any access methods you like and so it might make sense to add an explicit what_utf8() or what_utf16() or what_iso8859_5().

Edit: Regarding John's comment on how to return UTF-8:

If you have a const char* what() function this function essentially returns a bunch of bytes. On a western european windows platform, these bytes would usually be encoded as Win1252, but on a russian windows it might as well be Win1251.

What the bytes return signify depends on their encoding and their encoding depends on where they "came from" (and who is interpreting them). A string literal's encoding is defined at compile time, but at runtime it's still up to the application how to interpret these.

So, to have your exception return UTF-8 strings with what() (or what_utf8()) you have to make sure that:

  • The input message to your exception has a well defined encoding
  • You have a well defined encoding for the string member you use to hold the message.
  • You appropriately convert the encoding when what()is called

Example:

struct MyExc : virtual public std::exception {
  MyExc(const char* msg)
  : exception(msg)
  { }
  std::string what_utf8() {
    return convert_iso8859_1_to_utf8( what() );
  }
};

// In a ISO-8859-1 encoded source file
const char* my_err_msg = "ISO-8859-1 ... äöüß ...";
...
throw MyExc(my_err_msg);
...
catch(MyExc const& e) {
  std::string iso8859_1_msg = e.what();
  std::string utf_msg = e.what_utf8();
...

The conversion could also be placed in the (overridden) what() member function of MyExc() or you could define the exception to take an already UTF-8 encoded string or you could convert (from an expected input encoding, maybe wchar_t/UTF-16) in the ctor.

like image 31
Martin Ba Avatar answered Oct 23 '22 11:10

Martin Ba


The first question is what do you intend to do with the what() string?

Do you plan to log the information somewhere?

If so you should not be using the content of the what() string you should be using that string as a reference to look up the correct local specific logging message. So to me the content of the what() is not for logging purposes (or any form of display) it is a method of looking up the actual logging string (which can be any Unicode string).

Now; It can be us-full for the what() string to contain a human readable message for the developers to help in quick debugging (but for this highly readable polished text is not required). As result there is no reason to support anything more than ASCII. Obey the KISS principle.

like image 4
Martin York Avatar answered Oct 23 '22 11:10

Martin York