I'm working on a terminal based program that has unicode support. There are certain cases where I need to determine how many terminal columns a string will consume before I print it. Unfortunately some characters are 2 columns wide (chinese, etc.), but I found this answer that indicates a good way to detect fullwidth characters is by calling u_getIntPropertyValue() from the ICU library.
Now I'm trying to parse the characters of my UTF8 string and pass them to this function. The problem I'm having now is that u_getIntPropertyValue() expects a UTF-32 code point.
What is the best way to obtain this from a utf8 string? I'm currently trying to do this with boost::locale (used elsewhere in my program), but I'm having trouble getting a clean conversion. My UTF32 strings that come from boost::locale are pre-pended with a zero-width character to indicate byte order. Obviously I can just skip the first four bytes of the string, but is there a cleaner way to do this?
Here is my current ugly solution:
inline size_t utf8PrintableSize(const std::string &str, std::locale loc)
{
namespace ba = boost::locale::boundary;
ba::ssegment_index map(ba::character, str.begin(), str.end(), loc);
size_t widthCount = 0;
for (ba::ssegment_index::iterator it = map.begin(); it != map.end(); ++it)
{
++widthCount;
std::string utf32Char = boost::locale::conv::from_utf(it->str(), std::string("utf-32"));
UChar32 utf32Codepoint = 0;
memcpy(&utf32Codepoint, utf32Char.c_str()+4, sizeof(UChar32));
int width = u_getIntPropertyValue(utf32Codepoint, UCHAR_EAST_ASIAN_WIDTH);
if ((width == U_EA_FULLWIDTH) || (width == U_EA_WIDE))
{
++widthCount;
}
}
return widthCount;
}
@n.m was correct: there is an easy way to do this with ICS directly. Updated code is below. I suspect I can probably just use UnicodeString and bypass the whole boost locale usage in this scenario.
inline size_t utf8PrintableSize(const std::string &str, std::locale loc)
{
namespace ba = boost::locale::boundary;
ba::ssegment_index map(ba::character, str.begin(), str.end(), loc);
size_t widthCount = 0;
for (ba::ssegment_index::iterator it = map.begin(); it != map.end(); ++it)
{
++widthCount;
//Note: Some unicode characters are 'full width' and consume more than one
// column on output. We will increment widthCount one extra time for
// these characters to ensure that space is properly allocated
UnicodeString ucs = UnicodeString::fromUTF8(StringPiece(it->str()));
UChar32 codePoint = ucs.char32At(0);
int width = u_getIntPropertyValue(codePoint, UCHAR_EAST_ASIAN_WIDTH);
if ((width == U_EA_FULLWIDTH) || (width == U_EA_WIDE))
{
++widthCount;
}
}
return widthCount;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With