Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is the char encoding same across programming languages?

Tags:

java

c++

python

c

char

A very easy (and kind of elegant) way how to convert a lower-case letter-containing char into an int is to do the following:

int convertLowercaseCharLettertoInt(char letter) {
    return letter - 'a';
}

However, this code assumes that the char encoding follows the same ordering as the alphabet. Or, more generally, it assumes that char follows the ASCII encoding.

  • I know that Java char is UTF-16 while C char is ASCII. Although UTF-16 is not backward-compatible with ASCII, the ordering of the first 128 letters is the same in both. So is the ordering of the first 128 chars the same in all major languages such as C, C++, Java, C#, JavaScript and Python?
  • Is the method above a safe thing to do in general (assuming the input is sanitized, etc.)? Or is it better to use hash-map or long switch statement approaches? The hash-map approach is, I think, the most elegant way how to solve this problem in the case of non-English alphabets. E.g. the Czech alphabet goes: a, á, b, c, č, d, ď, e, é, ě, f, g, h, ch, i, í, j, k, l, m, n, ň, o, ó, p, q, r, ř, s, š, t, ť, u, ú, ů, v, w, x, y, ý, z, ž.
like image 323
Augustin Avatar asked Aug 28 '15 14:08

Augustin


People also ask

Is ASCII value same for all languages?

No. ASCII are standard in every language and in every embedded systems that are working.

Does UTF-8 support all languages?

UTF-8 supports any unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL).

Does Java use ASCII encoding?

Java actually uses Unicode, which includes ASCII and other characters from languages around the world.

What is the character encoding standard used in Java language?

The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.


2 Answers

This has less to do with programming language, but more about the system's underlying character set. ASCII and all variants of Unicode will behave as you expect. 'a'...'z' are 26 consecutive code points. EBCDIC will not, so your trick will fail on an IBM/360 in most languages.

Java (and Python, and perhaps other) languages mandate Unicode encoding regardless of the underlying platform, so your trick will work there as well, assuming you can find a conforming Java implementation for your IBM mainframe.

like image 187
Lee Daniel Crocker Avatar answered Sep 29 '22 05:09

Lee Daniel Crocker


In C, the compiler could detect problems

#if 'a'+1=='b' && 'b'+1=='c' && 'c'+1=='d' && 'd'+1=='e' && 'e'+1=='f' \
  && 'f'+1=='g' && 'g'+1=='h' && 'h'+1=='i' && 'i'+1=='j' && 'j'+1=='k'\
  && 'k'+1=='l' && 'l'+1=='m' && 'm'+1=='n' && 'n'+1=='o' && 'o'+1=='p'\
  && 'p'+1=='q' && 'q'+1=='r' && 'r'+1=='s' && 's'+1=='t' && 't'+1=='u'\
  && 'u'+1=='v' && 'v'+1=='w' && 'w'+1=='x' && 'x'+1=='y' && 'y'+1=='z'

int convertLowercaseCharLettertoInt(char letter) {
  return letter - 'a';
}
#else
  int convertLowercaseCharLettertoInt(char letter) {
    static const char lowercase[] = "abcdefghijklmnopqrstuvwxyz";
    const char *occurrence = strchr(lowercase, letter);
    assert(letter && occurrence);
    return occurrence - lowercase;
  }
#endif

See also @John Bode code


Note: The following works in with all C encodings

int convertLowercaseOrUppercaseCharLettertoInt(char letter) {
  char s[2] = { letter, '\0' };
  return strtol(s, 0, 36) - 10;
}
like image 43
chux - Reinstate Monica Avatar answered Sep 29 '22 07:09

chux - Reinstate Monica