Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get ncurses to output astral plane unicode characters

I have the following piece of extremely simple code, which is supposed to output (amongst other things), three unicode characters:

/*
 * To build:
 *   gcc -o curses curses.c -lncursesw
 *
 * Expected result: display these chars:
 *   http://www.fileformat.info/info/unicode/char/2603/index.htm  (snowman)
 *   http://www.fileformat.info/info/unicode/char/26c4/index.htm  (snowman without snow)
 *   http://www.fileformat.info/info/unicode/char/1f638/index.htm (grinning cat face with smiling eyes)
 *
 * Looks like ncurses is NOT able to display second and third char
 * (only the first one is OK...)
 */

#include <ncurses.h>
#include <stdio.h>
#include <locale.h>

int
main (int argc, char *argv[])
{
    WINDOW *stdscr;
    char buffer[] = {
        '<',
        0xE2, 0x98, 0x83,       // U+2603 : snowman: OK
        0xE2, 0x9B, 0x84,       // U+26C4 : snowman without snow: ERROR (space displayed)
        0xF0, 0x9F, 0x98, 0xB8, // U+1F638: grinning cat face: ERROR (space displayed)
        '>',
        '\0' };

    setlocale (LC_ALL, "");

    stdscr = initscr ();
    mvwprintw (stdscr, 0, 0, buffer);
    getch ();
    endwin ();

    /* output the buffer outside of ncurses */
    printf("%s\n",buffer);
    return 0;
}

The final printf outputs all the characters as I'd expect "<☃⛄😸>" (since I'm using a correctly configured locale, terminal emulator and appropriate font combinations) - however the first part, which is supposed to output the text using ncurses functions doesn't work properly. You can only see the first character (the snowman), and the other two are just rendered as spaces. "<☃ >".

I've read numerous google posts saying I also need to include

#define _XOPEN_SOURCE_EXTENDED 1

in the source - but doing so hasn't changed the output for me at all.

So - am I doing something supremely stupid here, or is ncurses broken when using some parts of the unicode space?

like image 266
GodEater Avatar asked May 07 '14 19:05

GodEater


1 Answers

It's not exactly that ncurses is broken. More like, glibc is broken. Or whatever implementation of libc you are using; I'm just assuming that it is glibc.

Unlike simple console output (i.e., printf), ncurses needs to know how wide every character is when it is printed because it needs to maintain its own model of what the screen looks like, and where the cursor is. Not all Unicode codepoints are 1 unit wide, even with a proportional font: many codepoints are zero units wide (combining accents, for example), and quite a few are two units wide (Han ideographs) [Note 1].

It turns out that there is a standard C library function, wcwidth, which takes a wchar_t and returns 0, 1, or 2 (or theoretically any integer, but afaik those are the only implemented widths) if the character is "printable", and -1 if the character is invalid or a control character. The wide-character-enabled version of ncurses uses wcwidth to predict how far the cursor will move after the character is printed. If wcwidth returns the error indication, ncurses substitutes a space.

wcwidth reads the width from the WIDTH section of the locale's charmap, but that definition only provides the exceptions; any printable character without a defined width is assumed to have a width of 1. So wcwidth also needs to check to see if the character is printable, which is defined in the LC_CTYPE locale specification. That's the same data which drives the iswprint library function.

Unfortunately, there is no guarantee that the terminal emulator shares the same view of Unicode character data as the C library functions. And for characters whose actual display widths are different from the locale-configured width, ncurses will produce unexpected behaviour.

In this case, there's no problem with the width (the characters are all 1 unit wide, so the default is correct); the problem is that the characters actually exist in your console font and you want to use them, but they don't exist in glibc's character database, because that database is still based on Unicode 5.0. (In fact, that bug itself should be updated, because Unicode is now at 6.3, not 6.1.)

To help you see that, here's a tiny little program which dumps the configured ctype information for unicode codepoints [Note 2]:

#define _XOPEN_SOURCE 600
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <wctype.h>
#include <wchar.h>

#define CONC_(x,y) x##y
#define IS(x) (CONC_(isw,x)(c)?#x" ":"")

int main(int argc, char** argv) {
  setlocale(LC_CTYPE,"");
  for (int i = 1; i < argc; ++i) {
    wint_t c = strtoul(argv[i], NULL, 16);
    printf("Code %04X: width %d %s%s%s%s%s%s%s%s%s%s%s%s\n", c, wcwidth(c),
           IS(alpha),IS(lower),IS(upper),IS(digit),IS(xdigit),IS(alnum),
           IS(punct),IS(graph),IS(blank),IS(space),IS(print),IS(cntrl));
  }
  return 0;
}

Compile it you can look at your character data. It probably looks like this:

$ gcc -std=c11 -Wall -o wcinfo wcinfo.c
$ ./wcinfo 2603 26c4 1f638
Code 2603: width 1 punct graph print 
Code 26C4: width -1 
Code 1F638: width -1 

So, what to do? You could wait for the glibc database to get updated, but I suspect that's not going to happen anytime soon. So if you really want to use those characters, you'll need to modify your own locale definitions.

If you have the same glibc installation as I do (and the locale files haven't changed for a while, so you probably do), then you'll find your locale files in /usr/share/i18n/locales and in the actual locale file, the LC_CTYPE section will include the directive copy "i18n", which means that the actual ctype configuration is in the file /usr/share/i18n/locales/i18n. You can then edit that file to make appropriate changes. (Make a backup copy before you change the file, of course. And you'll need to sudo your editor because the file is only writable by root.)

First find the line which starts graph, [Note 3] and then search forwards for U26 (line 716 in my configuration, fwiw.) You'll find a line with an entry which looks like <U26A0>..<U26C3>;, which means that codepoints 26A0 through 26C3 are graphical (visible printing) characters. Expand that range as necessary. (I changed the 26C3 to 26C4 for a minimal test, but you might want to include more characters.) A few lines further down, you'll see the second plane graph ranges; add an appropriate entry. (Again, being minimalist, I added a new line:

   <U0001F638>;/

but you'll probably want to include a range. (The trailing / is the continuation marker, by the way.)

Next, go down a couple more lines, and you'll find the print section. Make exactly the same changes.

Then you can regenerate your locale information by running:

$ sudo locale-gen

And then you can test:

$ ./wcinfo 2603 26c4 1f638
Code 2603: width 1 punct graph print 
Code 26C4: width 1 graph print 
Code 1F638: width 1 graph print 

Once you do that, your original ncurses program should produce the expected output.

By the way, you can use wide character strings with ncurses; you don't have to manually produce UTF-8 encodings:

int
main (int argc, char *argv[])
{
    WINDOW *stdscr;
    setlocale (LC_ALL, "");
    const wchar_t* wstr = L"<\u2603\u26c4\U0001F638>";
    stdscr = initscr ();
    mvwaddwstr(stdscr, 0, 0, wstr);
    getch ();
    endwin ();
    return 0;
}

Notes

  1. For more information, see Wikipedia on halfwidth and fullwidth forms.

  2. It's a quick-and-dirty no-error-checking program, but it's good enough for what we need here. For production purposes, one would want a few more lines of code :)

  3. You might not need to fix the graph wctype; print might be sufficient. I didn't check. I did both because ncurses also sometimes needs to know whether characters are transparent, and it seemed safer to mark the character as visible, since it is.

like image 88
rici Avatar answered Nov 05 '22 13:11

rici