Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I check if char encoding is ASCII?

I would like to write the following function:

int char_index(char c) 
{
  if (is_ascii<char>)
    return c - 'A';
  else 
    return c == 'A' ? 0 :
           c == 'B' ? 1 :
           // ...
}

Is there a function like is_ascii in std? I'm imagining something like std::numeric_limits<T>::is_iec559 which says whether some floating point type T satisfies the requirements of the IEE 754 standard.


I think I can implement is_ascii myself with something like if (65 == 'A' && ...) that enumerates the entire ASCII charset, and compares them to the int representation, but that's annoying. Also, I'm not sure how to check non-printable characters like SOH (Start Of Heading), etc.

Is it even possible to write this function in user code, or do I have to rely on the implementation to provide such a function?

like image 416
cigien Avatar asked Sep 20 '20 15:09

cigien


1 Answers

I assume that you want to check if your compiler when translating string literals and character literals in your source code to machine code uses ascii encoding.

Is there a function like is_ascii in std?

Not that I know of.

I can implement is_ascii myself with something like if (65 == 'A' && ...) that enumerates the entire ASCII charset

So do that. Check characters that can be a c-char, so all from basic source character set:

a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " '

and escape sequences:

\a  \b  \f  \n  \r  \t  \v

There's no way to check "entire" ASCII charset, because the compiler doesn't transcribe the program to the entire ASCII charset. It only maps basic character set characters and escape sequences to it's machine representation, not the whole charset (there may be compiler extensions).

but that's annoying.

But that's the only way. To verify your implementation uses some character set you have to check all characters it uses. So check them. It's going to be consteval anyway.

how to check non-printable characters like SOH (Start Of Heading), etc.

Don't. SOH character can't be inside a character literal, you don't have to check them, because it's not possible to express it in C language. There is no \SOH escape sequence, 0x01 byte is not inside basic character set. Your compiler never translates a sequence of characters to SOH character. A valid program will be composed only from character from basic source character set. The interpretation of the SOH character is up to the thing that is going to receive it and if I write '\001' it's going to be byte equal to 1 irrelevant of the encoding.


Meh, let's write it! The following program:

#include <type_traits>
#include <algorithm>
constexpr bool compiler_uses_ascii() {
    return 
        '\a'==0x07  &&  '\b'==0x08  &&  '\t'==0x09  &&  '\n'==0x0a  &&  '\v'==0x0b  &&  '\f'==0x0c  &&
        '\r'==0x0d  &&  '!'==0x21   &&  '#'==0x23   &&  '%'==0x25   &&  '&'==0x26   &&  '\''==0x27  &&
        '('==0x28   &&  ')'==0x29   &&  '*'==0x2a   &&  '+'==0x2b   &&  ','==0x2c   &&  '-'==0x2d   &&
        '.'==0x2e   &&  '/'==0x2f   &&  '0'==0x30   &&  '1'==0x31   &&  '2'==0x32   &&  '3'==0x33   &&
        '4'==0x34   &&  '5'==0x35   &&  '6'==0x36   &&  '7'==0x37   &&  '8'==0x38   &&  '9'==0x39   &&
        ':'==0x3a   &&  ';'==0x3b   &&  '<'==0x3c   &&  '='==0x3d   &&  '>'==0x3e   &&  '?'==0x3f   &&
        'A'==0x41   &&  'B'==0x42   &&  'C'==0x43   &&  'D'==0x44   &&  'E'==0x45   &&  'F'==0x46   &&
        'G'==0x47   &&  'H'==0x48   &&  'I'==0x49   &&  'J'==0x4a   &&  'K'==0x4b   &&  'L'==0x4c   &&
        'M'==0x4d   &&  'N'==0x4e   &&  'O'==0x4f   &&  'P'==0x50   &&  'Q'==0x51   &&  'R'==0x52   &&
        'S'==0x53   &&  'T'==0x54   &&  'U'==0x55   &&  'V'==0x56   &&  'W'==0x57   &&  'X'==0x58   &&
        'Y'==0x59   &&  'Z'==0x5a   &&  '['==0x5b   &&  '\\'==0x5c  &&  ']'==0x5d   &&  '^'==0x5e   &&
        '_'==0x5f   &&  'a'==0x61   &&  'b'==0x62   &&  'c'==0x63   &&  'd'==0x64   &&  'e'==0x65   &&
        'f'==0x66   &&  'g'==0x67   &&  'h'==0x68   &&  'i'==0x69   &&  'j'==0x6a   &&  'k'==0x6b   &&
        'l'==0x6c   &&  'm'==0x6d   &&  'n'==0x6e   &&  'o'==0x6f   &&  'p'==0x70   &&  'q'==0x71   &&
        'r'==0x72   &&  's'==0x73   &&  't'==0x74   &&  'u'==0x75   &&  'v'==0x76   &&  'w'==0x77   &&
        'x'==0x78   &&  'y'==0x79   &&  'z'==0x7a   &&  '{'==0x7b   &&  '|'==0x7c   &&  '}'==0x7d   &&
        '~'==0x7e;
}
constexpr int char_index(char c)
{
    if constexpr (compiler_uses_ascii()) {
        return c - 'A';
    } else {
        // Is that right? Maybe it is.
        const char a[] = "ABCDEFGHIJKLMNOPRSTUVXYZ";
        return std::find(a, a + sizeof(a), c) - a;
#if 0
        return
            c == 'A' ? 0 :  c == 'B' ? 1 :  c == 'C' ? 2 :  c == 'D' ? 3 :
            c == 'E' ? 4 :  c == 'F' ? 5 :  c == 'G' ? 6 :  c == 'H' ? 7 :
            c == 'I' ? 8 :  c == 'J' ? 9 :  c == 'K' ? 10 : c == 'L' ? 11 :
            c == 'M' ? 12 : c == 'N' ? 13 : c == 'O' ? 14 : c == 'P' ? 15 :
            c == 'Q' ? 16 : c == 'R' ? 17 : c == 'S' ? 18 : c == 'T' ? 19 :
            c == 'U' ? 20 : c == 'V' ? 21 : c == 'W' ? 22 : c == 'X' ? 23 :
            c == 'Y' ? 24 : c == 'Z' ? 25 : -1;
#endif
    }
}
#include <iostream>
int main() {
    std::cout << compiler_uses_ascii() << " " << char_index('B') << "\n";
}

when executed outputs:

$ g++ 1.cpp -std=c++20 && ./a.out
1 1
$ g++ 1.cpp -fexec-charset=IBM-1047 -std=c++20 && ./a.out
0@1%
like image 188
KamilCuk Avatar answered Sep 19 '22 01:09

KamilCuk