Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why there are no "unsigned wchar_t" and "signed wchar_t" types?

The signedness of char is not standardized. Hence there are signed char and unsigned char types. Therefore functions which work with single character must use the argument type which can hold both signed char and unsigned char (this type was chosen to be int), because if the argument type was char, we would get type conversion warnings from the compiler (if -Wconversion is used) in code like this:

char c = 'ÿ';
if (islower((unsigned char) c)) ...

warning: conversion to ‘char’ from ‘unsigned char’ may change the sign of the result

(here we consider what would happen if the argument type of islower() was char)

And the thing which makes it work without explicit typecasting is automatic promotion from char to int.

Further, the ISO C90 standard, where wchar_t was introduced, does not say anything specific about the representation of wchar_t.

Some quotations from glibc reference:

it would be legitimate to define wchar_t as char

if wchar_t is defined as char the type wint_t must be defined as int due to the parameter promotion.

So, wchar_t can perfectly well be defined as char, which means that similar rules for wide character types must apply, i.e., there may be implementations where wchar_t is positive, and there may be implementations where wchar_t is negative. From this it follows that there must exist unsigned wchar_t and signed wchar_t types (for the same reason as there are unsigned char and signed char types).

Private communication reveals that an implementation is allowed to support wide characters with >=0 value only (independently of signedness of wchar_t). Anybody knows what this means? Does thin mean that when wchar_t is 16-bit type (for example), we can only use 15 bits to store the value of wide character? In other words, is it true that a sign-extended wchar_t is a valid value? See also this question.

Also, private communication reveals that the standard requires that any valid value of wchar_t must representable by wint_t. Is it true?

Consider this example:

#include <locale.h>
#include <ctype.h>
int main (void)
{
  setlocale(LC_CTYPE, "fr_FR.ISO-8859-1");

  /* 11111111 */
  char c = 'ÿ';

  if (islower(c)) return 0;
  return 1;
}

To make it portable, we need the cast to '(unsigned char)'. This is necessary because char may be the equivalent signed char, in which case a byte where the top bit is set would be sign extended when converting to int, yielding a value that is outside the range of unsigned char.

Now, why is this scenario different from the following example for wide characters?

#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
  setlocale(LC_CTYPE, "");
  wchar_t wc = L'ÿ';

  if (iswlower(wc)) return 0;
  return 1;
}

We need to use iswlower((unsigned wchar_t)wc) here, but there is no unsigned wchar_t type.

Why there are no unsigned wchar_t and signed wchar_t types?

UPDATE

Are the standards saying that casting to unsigned int and to int in the following two programs is guaranteed to be correct? (I just replaced wint_t and wchar_t to their actual meaning in glibc)

#include <locale.h>
#include <wchar.h>
int main(void)
{
  setlocale(LC_CTYPE, "en_US.UTF-8");
  unsigned int wc;
  wc = getwchar();
  putwchar((int) wc);
}

--

#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
  setlocale(LC_CTYPE, "en_US.UTF-8");
  int wc;
  wc = L'ÿ';
  if (iswlower((unsigned int) wc)) return 0;
  return 1;
}
like image 666
Igor Liferenko Avatar asked Nov 23 '16 03:11

Igor Liferenko


1 Answers

TL;DR:

Why there are no unsigned wchar_t and signed wchar_t types?

Because C's wide-character handling facilities were defined such that they are not needed.


In more detail,

The signedness of char is not standardized.

To be precise, "The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char." (C2011, 6.2.5/15)

Hence there are signed char and unsigned char types.

"Hence" implies causation, which would be hard to argue clearly, but certainly signed char and unsigned char are more appropriate when you want to handle numbers, as opposed to characters.

Therefore functions which work with single character must use the argument type which can hold both signed char and unsigned char

No, not at all. Standard library functions that work with individual characters could easily be defined in terms of type char, regardless of whether that type is signed, because the library implementation does know its signedness. If that were a problem then it would apply equally to the string functions, too -- char would be useless.

Your example of getchar() is non-apposite. It returns int rather than a character type because it needs to be able to return an error indicator that does not correspond to any character. Moreover, the code you present does not correspond to the accompanying warning message: it contains a conversion from int to unsigned char, but no conversion from char to unsigned char.

Some other character-handling functions accept int parameters or return values of type int both for compatibility with getchar() and other stdio functions, and for historic reasons. In days of yore, you couldn't actually pass a char at all -- it would always be promoted to int, and that is what the functions would (and must) accept. One cannot later change the argument type, evolution of the language notwithstanding.

Further, the ISO C90 standard, where wchar_t was introduced, does not say anything specific about the representation of wchar_t.

C90 isn't really relevant any longer, but no doubt it says something very similar to C2011 (7.19/2), which describes wchar_t as

an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales [...].

Your quotations from the glibc reference are non-authoritative, except possibly for glibc only. They appear in any case to be commentary, not specification, and its unclear why you raise them. Certainly, though, at least the first is correct. Referring to the standard, if all the members of the largest extended character set specified among the locales supported by a given implementation could fit in a char then that implementation could define wchar_t as char. Such implementations used to be much more common than they are today.

You ask several questions:

Private communication reveals that an implementation is allowed to support wide characters with >=0 value only (independently of signedness of wchar_t). Anybody knows what this means?

I think it means that whoever communicated that to you doesn't know what they are talking about, or perhaps that what they are talking about is something different than the requirements placed by the C standard. You will find that in practice, character sets are defined with only non-negative character codes, but that is not a constraint placed by the C standard.

Does thin mean that when wchar_t is 16-bit type (for example), we can only use 15 bits to store the value of wide character?

The C standard does not say or imply that. You can store the value of any supported character in a wchar_t. In particular, if an implementation supports a character set containing character codes exceeding 32767, then you can store those in a wchar_t.

In other words, is it true that a sign-extended wchar_t is a valid value?

The C standard does not say or imply that. It does not even say whether wchar_t is a signed type (if not, then sign extension is meaningless for it). If it is a signed type, then there is no guarantee about whether sign-extending a value representing a character in some supported character set (which value could, in principle, be negative) will produce a value that also represents a character in that character set, or in any other supported character set. The same is true of adding 1 to a wchar_t value.

Also, private communication reveals that the standard requires that any valid value of wchar_t must representable by wint_t. Is it true?

It depends what you mean by "valid". The standard says that wint_t

is an integer type unchanged by default argument promotions that can hold any value corresponding to members of the extended character set, as well as at least one value that does not correspond to any member of the extended character set.

(C2011, 7.29.1/2)

wchar_t must be able to hold any value corresponding to a member of the extended character set, in any supported locale. wint_t must be able to hold all of those values, too. It may be, however, that wchar_t is capable of representing values that do not correspond to any character in any supported character set. Such values are valid in the sense that the type can represent them. wint_t is not required to be able to represent such values.

For example, if the largest extended character set of any supported locale uses character codes up to but not exceeding 32767, then an implementation would be free to implement wchar_t as an unsigned 16-bit integer, and wint_t as a signed 16-bit integer. The values representable by wchar_t that do not correspond to extended characters are then not representable by wint_t (but wint_t still has many candidates for its required value that does not correspond to any character).

With respect to the character and wide-character classification functions, the only answer is that the differences simply arise from different specifications. The char classification functions are defined to work with the same values that getchar() is defined to return -- either -1 or a character value converted, if necessary, to unsigned char. The wide character classification functions, on the other hand, accept arguments of type wint_t, which can represent the values of all wide-character unchanged, therefore there is no need for a conversion.

You claim in this regard that

We need to use iswlower((unsigned wchar_t)wc) here, but there is no unsigned wchar_t type.

No and maybe. You do not need to convert the wchar_t argument to iswlower() to any other type, and in particular, you do not need to convert it to an explicitly unsigned type. The wide character classification functions are not analogous to the regular character classification functions in this respect, having been designed with the benefit of hindsight. As for unsigned wchar_t, C does not require such a type to exist, so portable code should not use it, but it may exist in some implementations.


Regarding the update appended to the question:

Are the standards saying that casting to unsigned int and to int in the following two programs is guaranteed to be correct? (I just replaced wint_t and wchar_t to their actual meaning in glibc)

The standard says nothing of the sort about conforming implementations in general. I'll suppose, however, that you mean to ask specifically about conforming implementations for which wchar_t is int and wint_t is unsigned int.

On such an implementation, your first program is flawed because it does not account for the possibility that getwchar() returns WEOF. Converting WEOF to type wchar_t, if doing so does not cause a signal to be raised, is not guaranteed to produce a value that corresponds to any wide character. Passing the result of such a conversion to putwchar() therefore does not exhibit defined behavior. Moreover, if WEOF is defined with the same value as UINT_MAX (which is not representable by int) then the conversion of that value to int has implementation-defined behavior independently of the putwchar() call.

On the other hand, I think the key point you are struggling with is that if the value returned by getwchar() in the first program is not WEOF, then it is guaranteed to be one that is unchanged by conversion to wchar_t. Your first program will perform as appears to be intended in that case, but the cast to int (or wchar_t) is unnecessary.

Similarly, the second program is correct provided that the wide-character literal corresponds to a character in the applicable extended character set, but the cast is unnecessary and changes nothing. The wchar_t value of such a literal is guaranteed to be representable by type wint_t, so the cast changes the type of its operand, but not the value. (But if the literal does not correspond to a character in the extended character set then the behavior is implementation-defined.)

On the third hand, if your objective is to write strictly-conforming code then the right thing to do, and indeed the intended usage mode of these particular wide-character functions, would be this:

#include <locale.h>
#include <wchar.h>
int main(void)
{
  setlocale(LC_CTYPE, "en_US.UTF-8");
  wint_t wc = getwchar();
  if (wc != WEOF) {
    // No cast is necessary or desirable
    putwchar(wc);
  }
}

and this:

#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
  setlocale(LC_CTYPE, "en_US.UTF-8");
  wchar_t wc = L'ÿ';
  // No cast is necessary or desirable
  if (iswlower(wc)) return 0;
  return 1;
}
like image 117
John Bollinger Avatar answered Nov 15 '22 04:11

John Bollinger