I'm writing a terminal (console) application that is supposed to wrap arbitrary unicode text.
Terminals are usually using a monospaced (fixed width) font, so to wrap a text, it's barely more than counting characters and watching whether a word fits into a line or not and act accordingly.
Problem is that there are fullwidth characters in the Unicode table that take up the width of 2 characters in a terminal.
Counting these would see 1 unicode character, but the printed character is 2 "normal" (halfwidth) characters wide, breaking the wrapping routine as it is not aware of chars that take up twice the width.
As an example, this is a fullwidth character (U+3004, the JIS symbol)
〄 12
It does not take up the full width of 2 characters here although it's preformatted, but it does use twice the width of a western character in a terminal.
To deal with this, I have to distinguish between fullwidth or halfwidth characters, but I cannot find a way to do so in C++. Is it really necessary to know all fullwidth characters in the unicode table to get around the problem?
You should use ICU u_getIntPropertyValue
with the UCHAR_EAST_ASIAN_WIDTH
property.
For example:
bool is_fullwidth(UChar32 c) {
int width = u_getIntPropertyValue(c, UCHAR_EAST_ASIAN_WIDTH);
return width == U_EA_FULLWIDTH || width == U_EA_WIDE;
}
Note that if your graphics library supports combining characters then you'll have to consider those as well when determining how many cells a sequence uses; for example e
followed by U+0301
COMBINING ACUTE ACCENT will only take up 1 cell.
There's no need to build tables, people from Unicode have already done that:
http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
The same code is used in terminal emulating software such as xterm
[1], konsole
[2] and quite likely others...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With