Disclaimer: My apologies for all the text below (for a single simple question), but I sincerely think that every bit of information is relevant to the question. I'd be happy to learn otherwise. I can only hope that, if successful, the question(s) and the answers may help others in Unicode madness. Here goes.
I have read all the usually highly-regarded websites about utf8, particularly this one is very good for my purposes, but I've read the classics too, like those mentioned in other similar questions in SO. However, I still lack the knowledge about how to integrate it all in my virtual lab. I use Emacs with
;; Internationalization
(prefer-coding-system 'utf-8)
(setq locale-coding-system 'utf-8)
(set-terminal-coding-system 'utf-8)
(set-keyboard-coding-system 'utf-8)
(set-selection-coding-system 'utf-8)
in my .emacs, xterm started with
LC_CTYPE=en_US.UTF-8 xterm -geometry 91x58\
-fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'
and my locale reads:
LANG=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
My questions are the following (some of the answers may be the expected behavior of the application, but I still need to make sense of it, so bear with me):
Supposing the following C program:
#include <stdio.h>
int main(void) {
int c;
while((c=getc(stdin))!=EOF) {
if(c!='\n') {
printf("Character: %c, Integer: %d\n", c, c);
}
}
return 0;
}
If I run this in my xterm I get:
€
Character: � Integer: 226
Character: �, Integer: 130
Character: �, Integer: 172
(just in case the chars I get are a white question mark within a black circle). The ints are the decimal representation of the 3 bytes needed to encode €, but I am not exactly sure why xterm does not display them properly.
Instead, Mousepad, eg, prints
Character: â, Integer: 226
Character: ,, Integer: 130 (a comma, standing forU+0082 <control>, why?!)
Character: ¬, Integer: 172
Meanwhile, Emacs displays
Character: \342, Integer: 226
Character: \202, Integer: 130
Character: \254, Integer: 172
QUESTION: The most general question I can ask is: How do I get everything to print the same character? But I am certain there will be follow-ups.
Thanks again, and apologies for all the text.
Ok, so your problem here is due to mixing old-school C library calls (getc, printf %c) and UTF-8. Your code is correctly reading the three bytes which make up '€' - 226, 130 and 172 as decimal - but these values individually are not valid UTF-8 encoded glyphs.
If you look at the UTF-8 encoding, Integer values 0..127 are the encodings for the original US-ASCII character set. However 128..255 (i.e. all your bytes) are part of a multibyte UTF-8 character, and so don't correspond to a valid UTF-8 character invidually.
In other words the single byte '226' doesn't mean anything on it's own (as it's the prefix for a 3-byte character - as expected). The printf
call prints it as a single byte, which is invalid with the UTF-8 encoding, so each different program copes with the invalid value in different ways.
Assuming you just want to 'see' what bytes UTF-8 character is made of, I suggest you stick to the integer output you already have (or maybe use hex if that is more sensible) - as your >127 bytes arn't valid unicode you're unlikely to get consistent results across different programs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With