Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Emacs, xterm, mousepad, C, Unicode and UTF-8: Trying to make sense of it all

Disclaimer: My apologies for all the text below (for a single simple question), but I sincerely think that every bit of information is relevant to the question. I'd be happy to learn otherwise. I can only hope that, if successful, the question(s) and the answers may help others in Unicode madness. Here goes.

I have read all the usually highly-regarded websites about utf8, particularly this one is very good for my purposes, but I've read the classics too, like those mentioned in other similar questions in SO. However, I still lack the knowledge about how to integrate it all in my virtual lab. I use Emacs with

;; Internationalization
(prefer-coding-system 'utf-8)
(setq locale-coding-system 'utf-8)
(set-terminal-coding-system 'utf-8)
(set-keyboard-coding-system 'utf-8)
(set-selection-coding-system 'utf-8)

in my .emacs, xterm started with

 LC_CTYPE=en_US.UTF-8 xterm -geometry 91x58\
-fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'

and my locale reads:

LANG=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

My questions are the following (some of the answers may be the expected behavior of the application, but I still need to make sense of it, so bear with me):

Supposing the following C program:

#include <stdio.h>

int main(void) {
  int c;
  while((c=getc(stdin))!=EOF) {
    if(c!='\n') {
      printf("Character: %c, Integer: %d\n", c, c);
    }
  }
  return 0;
}

If I run this in my xterm I get:

€
Character: � Integer: 226
Character: �, Integer: 130
Character: �, Integer: 172

(just in case the chars I get are a white question mark within a black circle). The ints are the decimal representation of the 3 bytes needed to encode €, but I am not exactly sure why xterm does not display them properly.

Instead, Mousepad, eg, prints

Character: â, Integer: 226
Character: ,, Integer: 130 (a comma, standing forU+0082 <control>, why?!)
Character: ¬, Integer: 172

Meanwhile, Emacs displays

Character: \342, Integer: 226
Character: \202, Integer: 130
Character: \254, Integer: 172

QUESTION: The most general question I can ask is: How do I get everything to print the same character? But I am certain there will be follow-ups.

Thanks again, and apologies for all the text.

like image 445
Dervin Thunk Avatar asked Jul 17 '09 22:07

Dervin Thunk


1 Answers

Ok, so your problem here is due to mixing old-school C library calls (getc, printf %c) and UTF-8. Your code is correctly reading the three bytes which make up '€' - 226, 130 and 172 as decimal - but these values individually are not valid UTF-8 encoded glyphs.

If you look at the UTF-8 encoding, Integer values 0..127 are the encodings for the original US-ASCII character set. However 128..255 (i.e. all your bytes) are part of a multibyte UTF-8 character, and so don't correspond to a valid UTF-8 character invidually.

In other words the single byte '226' doesn't mean anything on it's own (as it's the prefix for a 3-byte character - as expected). The printf call prints it as a single byte, which is invalid with the UTF-8 encoding, so each different program copes with the invalid value in different ways.

Assuming you just want to 'see' what bytes UTF-8 character is made of, I suggest you stick to the integer output you already have (or maybe use hex if that is more sensible) - as your >127 bytes arn't valid unicode you're unlikely to get consistent results across different programs.

like image 158
DaveR Avatar answered Sep 20 '22 22:09

DaveR