Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I find out what the current charset is in C++?

How can I find out what the current charset is in C++?

In a console application (WinXP) I am getting negative values for some characters (like äöüé) with

(int)mystring[a]

and this surprises me. I was expecting the values to be between 127 and 256.

So is there something like GetCharset() or SetCharset() in c++?

like image 758
Stef Avatar asked Feb 27 '23 09:02

Stef


2 Answers

It depends on how you look at the value you have at hand. char can be signed(e.g. on Windows), or unsigned like on some other systems. So, what you should do is to print the value as unsigned to get what you are asking for.

C++ until now is char-set agnostic. For Windows console specifically, you can use: GetConsoleOutputCP.

like image 58
Khaled Alshaya Avatar answered Mar 07 '23 18:03

Khaled Alshaya


Look at std::numeric_limits<char>::min() and max(). Or CHAR_MIN and CHAR_MAX if you don't like typing, or if you need an integer constant expression.

If CHAR_MAX == UCHAR_MAX and CHAR_MIN == 0 then chars are unsigned (as you expected). If CHAR_MAX != UCHAR_MAX and CHAR_MIN < 0 they are signed (as you're seeing).

In the standard 3.9.1/1, ensures that there are no other possibilities: "... a plain char can take on either the same values as a signed char or an unsigned char; which one is implementation-defined."

This tells you whether char is signed or unsigned, and that's what's confusing you. You certainly can't call anything to modify it: from the POV of a program it's baked into the compiler even if the compiler has ways of changing it (GCC certainly does: -fsigned-char and -funsigned-char).

The usual way to deal with this is if you're going to cast a char to int, cast it through unsigned char first. So in your example, (int)(unsigned char)mystring[a]. This ensures you get a non-negative value.

It doesn't actually tell you what charset your implementation uses for char, but I don't think you need to know that. On Microsoft compilers, the answer is essentially that commonly-used character encoding "ISO-8859-mutter-mutter". This means that chars with 7-bit ASCII values are represented by that value, while values outside that range are ambiguous, and will be interpreted by a console or other recipient according to how that recipient is configured. ISO Latin 1 unless told otherwise.

Properly speaking, the way characters are interpreted is locale-specific, and the locale can be modified and interrogated using a whole bunch of stuff towards the end of the C++ standard that personally I've never gone through and can't advise on ;-)

Note that if there's a mismatch between the charset in effect, and the charset your console uses, then you could be in for trouble. But I think that's separate from your issue: whether chars can be negative or not is nothing to do with charsets, just whether char is signed.

like image 22
Steve Jessop Avatar answered Mar 07 '23 16:03

Steve Jessop