Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does string work with non-ascii symbols while char does not?

I understand that char in C++ is just an integer type that stores ASCII symbols as numbers ranging from 0 to 127. The Scandinavian letters 'æ', 'ø', and 'å' are not among the 128 symbols in the ASCII table.

So naturally when I try char ch1 = 'ø' I get a compiler error, however string str = "øæå" works fine, even though a string makes use of chars right?

Does string somehow switch over to Unicode?

like image 368
That new guy Avatar asked Apr 25 '14 04:04

That new guy


People also ask

How do you put non-ASCII Unicode characters in a string?

Python allows Unicode string literals to be specified by adding a u character prefix before the string literal: This will create a Unicode string and the label will be appear correctly.

How do I find a non-ASCII character in a string?

The isascii() function returns a boolean value where True indicates that the string contains all ASCII characters and False indicates that the string contains some non-ASCII characters.

What are non-ASCII chars?

Non-ASCII characters are those that are not encoded in ASCII, such as Unicode, EBCDIC, etc. ASCII is limited to 128 characters and was initially developed for the English language. In this tutorial, we'll look at some tools to find and highlight non-ASCII characters within text files.


3 Answers

In C++ there is the source character set and the execution character set. The source character set is what you can use in your source code; but this doesn't have to coincide with which characters are available at runtime.

It's implementation-defined what happens if you use characters in your source code that aren't in the source character set. Apparently 'ø' is not in your compiler's source character set, otherwise you wouldn't have gotten an error; this means that your compiler's documentation should include an explanation of what it does for both of these code samples. Probably you will find that str does have some sort of sequence of bytes in it that form a string.

To avoid this you could use character literals instead of embedding characters in your source code, in this case '\xF8'. If you need to use characters that aren't in the execution character set either, you can use wchar_t and wstring.

like image 62
M.M Avatar answered Oct 27 '22 12:10

M.M


From the source code char c = 'ø';:

source_file.cpp:2:12: error: character too large for enclosing character literal type
  char c = '<U+00F8>';
           ^

What's happening here is that the compiler is converting the character from the source code encoding and determining that there's no representation of that character using the execution encoding that fits inside a single char. (Note that this error has nothing to do with the initialization of c, it would happen with any such character literal. examples)

When you put such characters into a string literal rather than a character literal, however, the compiler's conversion from the source encoding to the execution encoding is perfectly happy to use multi-byte representations of the characters when the execution encoding is multi-byte, such as UTF-8 is.

To better understand what compilers do in this area you should start by reading clauses 2.3 [lex.charsets], 2.14.3 [lex.ccon], and 2.14.5 [lex.string] in the C++ standard.

like image 45
bames53 Avatar answered Oct 27 '22 13:10

bames53


What's likely happening here is that your source file is encoded as UTF-8 or some other multi-byte character encoding, and the compiler is simply treating it as a sequence of bytes. A single char can only be a single byte, but a string is perfectly happy to be as many bytes as are required.

like image 43
Mark Ransom Avatar answered Oct 27 '22 13:10

Mark Ransom